I have a large ASCII file (~100GB) which consists of roughly 1.000.000 lines of known formatted numbers which I try to process with python. The file is too large to read in completely into memory, so I decided to process the file line by line:

fp = open(file_name)
for count,line in enumerate(fp):
    data = np.array(line.split(),dtype=np.float)
    #do stuff

It turns out, that I spend most of the run time of my program in the data = line. Are there any ways to speed up that line? Also, the execution speed seem much slower than what I could get from an native FORTRAN program with formated read (see this question, I've implemented a FORTRAN string processor and used it with f2py, but the run time was only comparable with the data = line. I guess the I/O handling and type conversions between Python/FORTRAN killed what I gained from FORTRAN)

Since I know the formatting, shouldn't there be a better and faster way as to use split()? Something like:

data = readf(line,'(1000F20.10)')

I tried the fortranformat package, which worked well, but in my case was three times slower than thee split() approach.

P.S. As suggested by ExP and root I tried the np.fromstring and made this quick and dirtry benchmark:

t1 = time.time()
for i in range(500):
t2 = time.time()    
print (t2-t1)/500
print data.shape
print data[0]


t1 = time.time()
for i in range(500):    
   data = np.fromstring(line,sep=' ',dtype=np.float,count=9002)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]

so fromstring is in fact slightly slower in my case.


2 回答 2



np.fromstring(line, dtype=np.float, sep=" ")
于 2013-04-10T08:32:34.603 回答


如果没有,那么您可能已经在使用最快的方法。您的逐行拆分为数组的方法与SciPy Cookbook 示例完全匹配。

于 2013-04-10T08:26:03.860 回答