python - 处理速度 - 编辑大 2GB 文本文件 python

Question

所以我有一个问题。我正在使用由 4 行的多个组成的 .txt 文件。我正在使用python 3。

我编写了一个代码，旨在获取文本文件的第 2 行和第 4 行，并仅保留这两行的前 20 个字符（同时保留第 1 行和第 3 行未编辑），并创建一个新的编辑文件，其中包含编辑了第 2 行和第 4 行以及未编辑的第 1 行和第 3 行。这种趋势对于每一行都是相同的，因为我使用的所有文本文件的行号总是 4 的倍数。

这适用于小文件（总共约 100 行），但我需要编辑的文件超过 5000 万行，需要 4 个多小时。

下面是我的代码。谁能给我一个关于如何加快我的程序的建议？谢谢！

import io
import os
import sys

newData = ""
i=0
run=0
j=0
k=1
m=2
n=3
seqFile = open('temp100.txt', 'r')
seqData = seqFile.readlines()
while i < 14371315:
    sLine1 = seqData[j] 
    editLine2 = seqData[k]
    sLine3 = seqData[m]
    editLine4 = seqData[n]
    tempLine1 = editLine2[0:20]
    tempLine2 = editLine4[0:20]
    newLine1 = editLine2.replace(editLine2, tempLine1)
    newLine2 = editLine4.replace(editLine4, tempLine2)
    newData = newData + sLine1 + newLine1 + '\n' + sLine3 + newLine2
    if len(seqData[k]) > 20:
         newData += '\n'
    i=i+1
    run=run+1
    j=j+4
    k=k+4
    m=m+4
    n=n+4
    print(run)

seqFile.close()

new = open("new_100temp.txt", "w")
sys.stdout = new
print(newData)

score 2 · Accepted Answer

这里最大的问题似乎是一次读取整个文件：

seqData = seqFile.readlines()

相反，您应该首先打开源文件和输出文件。然后遍历第一个文件并根据需要操作这些行：

outfile = open('output.txt', 'w')
infile = open('input.txt', 'r')

i = 0
for line in infile:
    if i % 2 == 0:
       newline = line
    else:
       newline = line[:20]

    outfile.write( newline )
    i += 1

outfile.close()
infile.close()

score 2 · Accepted Answer

您正在使用内存中的两个文件（输入和输出）。如果文件很大（分页），可能会导致时间问题。试试（伪代码）

Open input file for read
Open output file for write
Initialize counter to 1
While not EOF in input file
    Read input line
    If counter is odd 
        Write line to output file
    Else
        Write 20 first characters of line to output file
    Increment counter
Close files

score 2 · Accepted Answer

如果您一次只阅读 4 行并处理它们（未经测试），它可能会快得多：

with open('100temp.txt') as in_file, open('new_100temp.txt', 'w') as out_file:
    for line1, line2, line3, line4 in grouper(in_file, 4):
         # modify 4 lines
         out_file.writelines([line1, line2, line3, line4])

wheregrouper(it, n)是一个函数，它一次产生n一个 iterabelit的项目。它是作为模块的示例之一给出的itertools（另请参见SO 的这个答案）。以这种方式迭代文件类似于调用readlines()文件然后手动迭代结果列表，但它一次只将几行读入内存。

score 1 · Accepted Answer

请参阅文档以了解读取文件的最佳方式。不要将所有内容都保存在内存中，这就是您正在做的事情seqData = seqFile.readlines()，只需遍历即可。Python 负责缓冲等。为您，所以它是快速和高效的。此外，您不应该自己打开和关闭文件（就像其他答案一样）——使用with关键字。

lineCount = 0
with open("new_100temp.txt", "w") as newFile, open("100temp.txt","r") as oldFile:
    for line in oldFile:
        #start on line 1, keep 1st and 3rd as is, modify 2nd and 4th
        lineCount += 1
        if lineCount%4 == 1 or lineCount%4 == 3: 
            newFile.write(line)
        else:
            newFile.write(line[:20] + "\n")
            # printing is really slow, so only do it every 100th iteration:
        if lineCount % 100 == 0:
            print lineCount

我刚刚在 100 万行垃圾文本上试了一下，不到一秒就完成了。正如 Kevin 所说，像这样的简单文本作业非常适合 shell 处理。

python - 处理速度 - 编辑大 2GB 文本文件 python

4 回答 4

Related

Reference