python - 如何创建索引来解析大文本文件

Question

我有两个 FASTQ 格式的文件 A 和 B，它们基本上是以 @ 开头的 4 行一组的几亿行文本，如下所示：

@120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN
+120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
bbbeee_[_ccdccegeeghhiiehghifhfhhhiiihhfhghigbeffeefddd]aegggdffhfhhihbghhdfffgdb^beeabcccabbcb`ccacacbbccB

我需要比较

5:1101:1156:2031#0/

文件 A 和 B 之间的部分，并在文件 B 中写入与新文件匹配的 4 行组。我在python中有一段代码可以做到这一点，但仅适用于小文件，因为它为文件A中的每个@-line解析文件B的整个@-行，并且两个文件都包含数亿行。

有人建议我应该为文件 B 创建一个索引；我在谷歌上四处搜索但没有成功，如果有人能指出如何做到这一点或让我知道一个教程以便我可以学习，我将不胜感激。谢谢。

==EDIT==理论上每组4行应该在每个文件中只存在一次。如果在每次比赛后打破解析，它会提高速度还是完全需要不同的算法？

score 1 · Accepted Answer

这些家伙声称在使用专用库时解析了一些 gigs 文件，请参阅http://www.biostars.org/p/15113/

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")

更好的方法 IMO 是解析一次并将数据加载到某个数据库而不是那个output_file（即 mysql），然后在那里运行查询

score 1 · Accepted Answer

索引只是您正在使用的信息的缩短版本。在这种情况下，您将需要“键”——@ 行上的第一个冒号（':'）和末尾附近的最后一个斜杠（'/'）之间的文本——以及某种值。

由于本例中的“值”是 4 行块的全部内容，并且由于我们的索引将为每个块存储一个单独的条目，所以如果我们使用实际值，我们会将整个文件存储在内存中索引。

相反，让我们使用 4 行块开头的文件位置。这样，您可以移动到该文件位置，打印 4 行，然后停止。总成本是存储整数文件位置所需的 4 或 8 个字节或多个字节，而不是实际基因组数据的多个字节。

这是一些完成这项工作的代码，但也做了很多验证和检查。你可能想扔掉你不使用的东西。

import sys

def build_index(path):
    index = {}
    for key, pos, data in parse_fastq(path):
        if key not in index:
            # Don't overwrite duplicates- use first occurrence.
            index[key] = pos

    return index

def error(s):
    sys.stderr.write(s + "\n")

def extract_key(s):
    # This much is fairly constant:
    assert(s.startswith('@'))
    (machine_name, rest) = s.split(':', 1)
    # Per wikipedia, this changes in different variants of FASTQ format:
    (key, rest) = rest.split('/', 1)
    return key

def parse_fastq(path):
    """
    Parse the 4-line FASTQ groups in path.
    Validate the contents, somewhat.
    """
    f = open(path)
    i = 0
    # Note: iterating a file is incompatible with fh.tell(). Fake it.
    pos = offset = 0
    for line in f:
        offset += len(line)
        lx = i % 4
        i += 1
        if lx == 0:     # @machine: key
            key = extract_key(line)
            len1 = len2 = 0
            data = [ line ]
        elif lx == 1:
            data.append(line)
            len1 = len(line)
        elif lx == 2:   # +machine: key or something
            assert(line.startswith('+'))
            data.append(line)
        else:           # lx == 3 : quality data
            data.append(line)
            len2 = len(line)

            if len2 != len1:
                error("Data length mismatch at line "
                        + str(i-2)
                        + " (len: " + str(len1) + ") and line "
                        + str(i)
                        + " (len: " + str(len2) + ")\n")
            #print "Yielding @%i: %s" % (pos, key)
            yield key, pos, data
            pos = offset

    if i % 4 != 0:
        error("EOF encountered in mid-record at line " + str(i));

def match_records(path, index):
    results = []
    for key, pos, d in parse_fastq(path):
        if key in index:
            # found a match!
            results.append(key)

    return results

def write_matches(inpath, matches, outpath):
    rf = open(inpath)
    wf = open(outpath, 'w')

    for m in matches:
        rf.seek(m)
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())

    rf.close()
    wf.close()

#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')

请注意，此代码返回到第一个文件以获取数据块。如果文件之间的数据相同，则可以在发生匹配时从第二个文件复制块。

另请注意，根据您尝试提取的内容，您可能希望更改输出块的顺序，并且您可能希望确保键是唯一的，或者可能确保键不唯一但在它们匹配的顺序。这取决于你——我不确定你在用数据做什么。

python - 如何创建索引来解析大文本文件

2 回答 2

Related

Reference