我将读取、解析和集成两个巨大的文本文件作为输入,然后创建新文件。
还有额外的另一个文件用于此解析。
简单解释一下,两个文本文件大约有 1 亿行和 3 列。
首先,读取两个不同的文件并将匹配的两个值写入新文件。
如果输入文件之一没有匹配的值,则将 0.0 插入到每行的矩阵中。
为了提高这种解析的效率,我从两个文本文件中制作了另一个输入文件,它是关于第一列(键)的联合文件,如下所示。
我用小输入文件(10000 行)测试了这段代码。它运作良好。两天前我开始用巨大的数据集运行这段代码,不幸的是它仍在运行。
如何减少运行时间并高效解析?
1st_infile.txt
MARCH2_MARCH2 2.3 0.1
MARCH2_MARC2 -0.2 0
MARCH2_MARCH5 -0.3 0.3
MARCH2_MARCH6 -1.4 0
MARCH2_MARCH7 0.1 0
MARCH2_SEPT2 -1.0 0
MARCH2_SEPT4 0.8 0
2nd_infile.txt
MARCH2_MARCH2 2.2 0
MARCH2_MARCH2.1 0.2 0
MARCH2_MARCH3 -0.4 0
MARCH2_MARCH5 -0.3 0
MARCH2_MARCH6 -0.6 0
MARCH2_MARCH7 1.2 0
MARCH2_SEPT2 0.2 0
union_file.txt
MARCH2_MARCH2
MARCH2_MARCH2.1
MARCH2_MARC2
MARCH2_MARCH5
MARCH2_MARCH6
MARCH2_MARCH7
MARCH2_SEPT2
MARCH2_SEPT4
MARCH2_MARCH3
输出文件.txt
MARCH2_MARCH2 2.3 0.1 2.2 0
MARCH2_MARCH2.1 0.0 0.0 0.2 0
MARCH2_MARC2 -0.2 0 0.0 0.0
MARCH2_MARCH5 -0.3 0.3 -0.3 0
MARCH2_MARCH6 -1.4 0 -0.6 0
MARCH2_MARCH7 1.2 0 1.2 0
MARCH2_SEPT2 -1.0 0 0.2 0
MARCH2_SEPT4 0.8 0 0.0 0.0
MARCH2_MARCH3 0.0 0.0 -0.4 0
Python.py
def load(filename):
ret = {}
with open(filename) as f:
for lineno, line in enumerate(f, 1):
try:
name, value1, value2 = line.split()
except ValueError:
print('Skip invalid line {}:{}L {0!r}'.format(filename, lineno, line))
continue
ret[name] = value1, value2
return ret
a, b = load('1st_infile.txt'), load('2nd_infile.txt')
with open ('Union_file.txt') as f:
with open('Outfile.txt', 'w') as fout:
for line in f:
name = line.strip()
fout.write('{0:<20} {1[0]:>5} {1[1]:>5} {2[0]:>5} {2[1]:>5}\n'.format(
name,
a.get(name, (0, 0)),
b.get(name, (0, 0))
))