python - 在 python 2.7 中解析巨大的结构化文件

Question

我是python世界和生物信息学的新手。我正在处理一个将近 50GB 的结构化文件来写出来。所以我想从你那里得到一些很棒的建议。

文件是这样的。（实际上叫做 FASTQ_format）

@Machinename:~:Team1:atcatg   1st line.
atatgacatgacatgaca            2nd line.       
+                             3rd line.           
asldjfwe!@#$#%$               4th line.

这四行按顺序重复。这4行就像一个团队。而且我有近30个候选DNA序列。例如atgcat，tttagc

我正在做的是让每个候选 DNA 序列通过巨大的文件来查找候选序列是否与团队 dna 序列相似，这意味着每个候选序列允许一个不匹配（例如taaaaa= aaaata），如果它们相似或相同，我使用字典存储它们以便以后写出来。候选DNA序列的关键。List 中（4 行）的值以按行顺序存储它们

所以我所做的是：

def myfunction(str1, str2): # to find if they are similar( allowed one mis match) if they are similar, it returns true

    f = open('hugefile')
    diction = {}
    mylist = ['candidate dna sequences1','dna2','dna3','dna4'...]
    while True:
      line = f.readline()
      if not line:
         break
      if "machine name" in line:
         teamseq = line.split(':')[-1]
         if my function(candidate dna, team dna) == True:
             if not candidate dna in diction.keys():
                diction[candidate dna] = []
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
             else:          # chances some same team dna are repeated.
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
                diction[candidate dna].append(line)
    f.close()

    wf = open(hughfile+".out", 'w')
    for i in candidate dna list:   # dna 1 , dna2, dna3
          wf.write(diction[i] + '\n')
    wf.close()

我的函数不使用任何全局变量（我想我对我的函数很满意），而字典变量是一个全局变量，它获取所有数据以及制作大量列表实例。代码很简单，但速度很慢，对 CPU 和内存来说是一个巨大的痛苦。我虽然使用 pypy。

那么有什么提示是按行顺序写出来的吗？

score 1 · Accepted Answer

我建议同时打开输入和输出文件，并在逐步输入时写入输出。就像现在一样，您正在将 50GB 读入内存，然后将其写出。这既缓慢又不必要。

在伪代码中：

with open(huge file) as fin, open(hughfile+".out", 'w') as fout:
   for line in f:
      if "machine name" in line:
          # read the following 4 lines from fin as a record
          # process that record
          # write the record to fout
          # the input record in no longer needed -- allow to be garbage collected...

正如我所概述的，前面的 4 行记录是在遇到它们时编写的，然后被处理掉。如果您需要参考diction.keys()以前的记录，请仅保留必要的最小值，set()以减少内存中数据的总大小。

python - 在 python 2.7 中解析巨大的结构化文件

1 回答 1

Related

Reference