我有一个大约 1 亿行的文件,我想用存储在制表符分隔文件中的替代文本替换其中的文本。我的代码有效,但处理前 70K 行大约需要一个小时。在尝试逐步提高我的 python 技能时,我想知道是否有更快的方法来做到这一点。谢谢!输入文件如下所示:
CHROMOSOME_IV ncRNA 基因 5723085 5723105。- 。ID=基因:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105。- 。父=基因:WBGene00045518
具有替换值的文件如下所示:
WBGene00045518 21ur-5153
这是我的代码:
infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')
import re
from datetime import datetime
startTime = datetime.now()
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
linecounter = 0
for line in infile2:
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
else:
outfile.write(line + '\n')
linecounter += 1
if linecounter in mult10K:
print linecounter
print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()