performance - 寻找一种方法来加快我的 Python 代码的文件部分的写入速度

Question

我有一个简单的代码，它读取一个 ~2Gb 的数据文件，提取我需要的数据列，然后将该数据作为列写入另一个文件以供以后处理。我昨晚运行了代码，花了将近九个小时才完成。我分别运行了这两个部分，并确定将数据写入新文件的部分是问题所在。我想问是否有人能指出我写它的方式为什么这么慢，以及关于更好方法的建议。

正在读入的数据样本

26980300000000  26980300000000  39  13456502685696  1543    0
26980300000001  26980300000000  38  13282082553856  1523    0.01
26980300000002  26980300000000  37  13465223692288  1544    0.03
26980300000003  26980300000000  36  13290803560448  1524    0.05
26980300000004  26980300000000  35  9514610851840   1091    0.06
26980300000005  26980300000000  34  9575657897984   1098    0.08
26980300000006  26980300000000  33  8494254129152   974     0.1
26980300000007  26980300000000  32  8520417148928   977     0.12
26980300000008  26980300000000  31  8302391459840   952     0.14
26980300000009  26980300000000  30  8232623931392   944     0.16

代码

F = r'C:\Users\mass_red.csv'

def filesave(TID,M,R):     
  X = str(TID)
  Y = str(M)
  Z = str(R) 
  w = open(r'C:\Users\Outfiles\acc1_out3.txt','a')
  w.write(X)
  w.write('\t')
  w.write(Y)
  w.write('\t')
  w.write(Z)
  w.write('\n')
  w.close()
  return()

N = 47000000
f = open(F)           
f.readline()          
nlines = islice(f, N) 

for line in nlines:                 
 if line !='':
      line = line.strip()         
      line = line.replace(',',' ') 
      columns = line.split()       
      tid = int(columns[1])
      m = float(columns[3])  
      r = float(columns[5])             
      filesave(tid,m,r)

score 2 · Accepted Answer

2

您为每一行打开和关闭文件。一开始就打开一次。

于 2015-02-15T02:19:35.210 回答

score 1 · Accepted Answer

在现代 Python 中，大多数文件使用都应该通过with语句来完成。打开很容易在标题中完成一次，关闭是自动的。这是用于线处理的通用模板。

inp = r'C:\Users\mass_red.csv'
out = r'C:\Users\Outfiles\acc1_out3.txt'
with open(inp) as fi, open(out, 'a') as fo:
    for line in fi:
        ...
        if keep:
            ...
            fo.write(whatever)

score 1 · Accepted Answer

这是您的代码的简化但完整的版本：

#!/usr/bin/env python
from __future__ import print_function
from itertools import islice

nlines_limit = 47000000
with open(r'C:\Users\mass_red.csv') as input_file, \
     open(r'C:\Users\Outfiles\acc1_out3.txt', 'w') as output_file:
    next(input_file) # skip line
    for line in islice(input_file, nlines_limit):
        columns = line.split()       
        try:
            tid = int(columns[1])
            m = float(columns[3])  
            r = float(columns[5])             
        except (ValueError, IndexError):
            pass # skip invalid lines
        else:
            print(tid, m, r, sep='\t', file=output_file)

我在您的输入中没有看到逗号；所以我已经line.replace(',', ' ')从代码中删除了。

performance - 寻找一种方法来加快我的 Python 代码的文件部分的写入速度

3 回答 3

Related

Reference