0

我是 python 新手,所以如果这个例子是微不足道的,我深表歉意。

我正在尝试编写一个简单的脚本,它将两个大型数据文件(每个约 40gb)的一部分粘贴并提取到一个格式稍有改变的结果文件中。我最初尝试使用 readlines(),但它会将所有文件读入内存,而我们的实例只有 28gb 的内存。使用 sizehint 参数仅解析文件的一部分。

我现在正在遍历文件。问题是我将文本解析的输出存储在三个变得相当大的列表中,使可用内存黯然失色。我认为这只会切换到使用交换,这很好,但它只是以“MemoryError”退出。

这适用于小样本文件,但会影响我们的实际数据。

剧本:

import sys

a = []
b = []
c = []

file1 = open(sys.argv[1],"r")
for line in file1:
    if '@' in line:
        a.append(line.lstrip('@').rstrip('\n'))
        b.append(file1.next().rstrip('\n'))
file1.close()

file2 = open(sys.argv[2],"r")
for line in file2:
    if '@' in line: 
        c.append(file2.next().rstrip('\n'))
file2.close()

file3 = open(sys.argv[3],"w")
for i in xrange(len(a)):
    file3.write("".join([">",a[i],'\n',b[i],":",c[i],"\n"]))

我在网上找到的建议创建某种数据库来存储变量,但这不是必需的。你有什么想法我应该如何处理这个?

为了完整起见,这就是我想要做的(来自我们的示例测试数据:

file1: 

@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file2:

@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file3 (output):

>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
4

3 回答 3

1

a您可以在解析文件时写入文件,而不是将文件解析为数组(bc)吗?

像这样的伪代码:

def get_line_with_at(a):
     while a:
         line = a.readline()
         if "@" in line:
             return line.strip()


# Open all file handles
a, b, c = [open(sys.argv[x + 1]) for x in range(3)]
out = open(sys.argv[4])

while a and b and c:
    # Repeat until a, b, and file handles are exhausted
    chunk1 = get_line_with_at(a)
    chunk2 = b.next().strip()
    chunk3 = get_line_with_at(c)

     out.write(
         ">%s\n%s:%s\n" % (chunk1, chunk2, chunk3))

这样你应该只需要很少加载到内存中(理论上 4 个文件句柄和当前行的内容)

于 2012-07-16T16:49:59.593 回答
0

我自己没有尝试过,但似乎以下应该有效:

file1 = open(sys.argv[1],"r")
file2 = open(sys.argv[2],"r")
file3 = open(sys.argv[3],"w")

for line1 in file1:
    if '@' in line1:  # line1.startswith('@') is probably better here
        a=line1.lstrip('@').rstrip('\n')
        b=file1.next().rstrip('\n')
        for line2 in file2:
            if '@' in line2:
                c=file2.next().rstrip('\n')
                break
        file3.write(">%s\n%s:%s\n"%(a,b,c))

file1.close()
file2.close()
file3.close()

在这种情况下,每个文件一次只在内存中保留一行……这应该没问题,除非文件的行很长;^)。

此外,由于您使用lstrip的是 '@' 字符,因此您可能需要考虑使用if line.startswith('@')而不是if '@' in line.

于 2012-07-16T16:51:42.013 回答
0

这是我的[第二个,更紧凑的]努力:

import sys
import itertools

def reader(fileobj, yield_at_line=False):
    for line in fileobj:
        if line.startswith('@'):
            at_line = line.lstrip('@').rstrip('\n')
            next_line = fileobj.next().rstrip('\n')
            yield (at_line, next_line) if yield_at_line else next_line

with open(sys.argv[1]) as file1, open(sys.argv[2]) as file2, open(sys.argv[3], "w") as file3:
    first = reader(file1, yield_at_line=True)
    second = reader(file2)
    for (a,b), c in itertools.izip(first, second):
        file3.write('>{}\n{}:{}\n'.format(a, b, c))

这使

~/coding$ cat file1
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

~/coding$ cat file2
@Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
+
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

~/coding$ python simulwork.py file1 file2 file3
~/coding$ cat file3
>Read.Salmonella_paratyphi_A_chromosome.29004.4835/1
TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACKIPPTCGTAG:TCGTGTACAGCATTCTTTATAGTGGAACGGTGACCGTACCGCAAAGCTGCGAAATCAACGCCGGACAAACGATTCT
于 2012-07-16T16:57:07.373 回答