I have a large ASCII file (~100GB) which consists of roughly 1.000.000 lines of known formatted numbers which I try to process with python. The file is too large to read in completely into memory, so I decided to process the file line by line:
fp = open(file_name)
for count,line in enumerate(fp):
data = np.array(line.split(),dtype=np.float)
#do stuff
fp.close()
It turns out, that I spend most of the run time of my program in the data =
line. Are there any ways to speed up that line? Also, the execution speed seem much slower than what I could get from an native FORTRAN program with formated read (see this question, I've implemented a FORTRAN string processor and used it with f2py, but the run time was only comparable with the data =
line. I guess the I/O handling and type conversions between Python/FORTRAN killed what I gained from FORTRAN)
Since I know the formatting, shouldn't there be a better and faster way as to use split()
? Something like:
data = readf(line,'(1000F20.10)')
I tried the fortranformat package, which worked well, but in my case was three times slower than thee split()
approach.
P.S. As suggested by ExP and root I tried the np.fromstring and made this quick and dirtry benchmark:
t1 = time.time()
for i in range(500):
data=np.array(line.split(),dtype=np.float)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00160977363586
(9002,)
0.0015162509
and:
t1 = time.time()
for i in range(500):
data = np.fromstring(line,sep=' ',dtype=np.float,count=9002)
t2 = time.time()
print (t2-t1)/500
print data.shape
print data[0]
0.00159792804718
(9002,)
0.0015162509
so fromstring
is in fact slightly slower in my case.