如果您可以一次打开所有 1445 个输出文件,这很容易:
paths = ['abc{}.dat'.format(i) for i in range(1445)]
files = [open(path, 'w') for path in paths]
for inpath in ('input{}.dat'.format(i) for i in range(40000)):
with infile as open(inpath, 'r') as infile:
for linenum, line in enumerate(infile):
files[linenum].write(line)
for f in files:
f.close()
如果您可以将所有内容都放入内存(听起来这应该是大约 0.5-5.0 GB 的数据,这对于具有 8GB RAM 的 64 位机器来说可能没问题……),您可以这样做:
data = [[] for _ in range(1445)]
for inpath in ('input{}.dat'.format(i) for i in range(40000)):
with infile as open(inpath, 'r') as infile:
for linenum, line in enumerate(infile):
data[linenum].append(line)
for i, contents in enumerate(data):
with open('abc{}.dat'.format(i), 'w') as outfile:
outfile.write(''.join(contents)
如果这些都不合适,您可能需要某种混合。例如,如果您一次可以处理 250 个文件,则执行 6 个批处理,并跳过batchnum
每个infile
.
如果批处理解决方案太慢,则在每个文件中的每个批处理结束时, stash infile.tell()
,当您再次回到文件时,使用infile.seek()
回到那里。像这样的东西:
seekpoints = [0 for _ in range(40000)]
for batch in range(6):
start = batch * 250
stop = min(start + 250, 1445)
paths = ['abc{}.dat'.format(i) for i in range(start, stop)]
files = [open(path, 'w') for path in paths]
for infilenum, inpath in enumerate('input{}.dat'.format(i) for i in range(40000)):
with infile as open(inpath, 'r') as infile:
infile.seek(seekpoints[infilenum])
for linenum, line in enumerate(infile):
files[linenum].write(line)
seekpoints[infilenum] = infile.tell()
for f in files:
f.close()