2

I have a large ASCII file (~100GB) which consists of roughly 1.000.000 lines of known formatted numbers which I try to process with python. The file is too large to read in completely into memory, so I decided to process the file line by line:

fp = open(file_name)
for count,line in enumerate(fp):
    data = np.array(line.split(),dtype=np.float)
    #do stuff
fp.close()

It turns out, that I spend most of the run time of my program in the data = line. Are there any ways to speed up that line? Also, the execution speed seem much slower than what I could get from an native FORTRAN program with formated read (see this question, I've implemented a FORTRAN string processor and used it with f2py, but the run time was only comparable with the data = line. I guess the I/O handling and type conversions between Python/FORTRAN killed what I gained from FORTRAN)

Since I know the formatting, shouldn't there be a better and faster way as to use split()? Something like:

data = readf(line,'(1000F20.10)')

I tried the fortranformat package, which worked well, but in my case was three times slower than thee split() approach.

P.S. As suggested by ExP and root I tried the np.fromstring and made this quick and dirtry benchmark:

t1 = time.time()
for i in range(500):
  data=np.array(line.split(),dtype=np.float)
t2 = time.time()    
print (t2-t1)/500
print data.shape
print data[0]
0.00160977363586
(9002,)
0.0015162509

and:

t1 = time.time()
for i in range(500):    
   data = np.fromstring(line,sep=' ',dtype=np.float,count=9002)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00159792804718
(9002,)
0.0015162509

so fromstring is in fact slightly slower in my case.

4

2 回答 2

2

你试过numpyp.fromstring吗?

np.fromstring(line, dtype=np.float, sep=" ")
于 2013-04-10T08:32:34.603 回答
1

np.genfromtxt函数是速度冠军,如果你能让它匹配你的输入格式

如果没有,那么您可能已经在使用最快的方法。您的逐行拆分为数组的方法与SciPy Cookbook 示例完全匹配。

于 2013-04-10T08:26:03.860 回答