python - python改善3Gb文件中查找行和删除行的时间

Question

第一篇文章，随心所欲...

我的问题：我有一个非常大的 1.4 亿行文件（文件 1）和一个略小的 300 万行文件（文件 2）。我想删除文件 1 中与文件 2 匹配的那些行。直观地说，这似乎是一个简单的查找和删除问题，不应该花那么长时间..相对而言。由于我的代码在 24Gb 处理器上运行大约需要 4 天。我想在几个文件上执行这个，所以我想及时改进。任何帮助和意见将不胜感激。

示例文件1：

reftig_0 43 0 1.0
reftig_0 44 1 1.0
reftig_0 45 0 1.0
reftig_0 46 1 1.0
reftig_0 47 0 5.0

示例文件 2：

reftig_0 43
reftig_0 44
reftig_0 45

代码：

data = open('file_1', 'r')
data_2 = open('file_2', 'r')
new_file = open('new_file_1', 'w')

d2= {}
for line in data_2:
    line= line.rstrip()
    fields = line.split(' ')
    key = (fields[0], fields[1])
    d2[key]=1

#print d2.keys()
#print d2['reftig_1']
tocheck=d2.keys()
tocheck.sort()
#print tocheck

for sline in data:
    sline = sline.rstrip()
    fields = sline.split(' ')
    nkey = (fields[0],fields[1])
    #print nkey
    if nkey in tocheck:
        pass
    else:
        new_file.write(sline + '\n')
        #print sline

score 4 · Accepted Answer

4

这可能会更好地使用grep：

grep -Fvf file2 file1

于 2012-10-21T01:35:39.473 回答

score 4 · Accepted Answer

您的脚本很慢，因为该行正在if nkey in tocheck检查nkey. list这非常非常慢，因为它是线性搜索（即使tocheck已排序）。

使用 aset代替：

def getkey(line):
    line = line.rstrip()
    fields = line.split(' ')
    return (fields[0], fields[1])

tocheck = {getkey(line) for line in data_2}

for line in data:
    if getkey(line) not in tocheck:
        new_file.write(line)

将它与 unutbu 的 write-batching 结合起来，您的脚本应该会运行得非常快。

score 3 · Accepted Answer

new_file每行写入一次短字符串很慢。通过将内容附加到列表来减少写入次数，并且new_file仅在列表长度为 1000 行时才写入。

N = 1000
with open('/tmp/out', 'w') as f:
    result = []
    for x in range(10**7):
        result.append('Hi\n')
        if len(result) >= N:
            f.write(''.join(result))
            result = []

以下是time test.py针对的各种值运行的结果N：

|      N | time (sec) |
|      1 |      5.879 |
|     10 |      2.781 |
|    100 |      2.417 |
|   1000 |      2.325 |
|  10000 |      2.299 |
| 100000 |      2.309 |

python - python改善3Gb文件中查找行和删除行的时间

3 回答 3

Related

Reference