0

所以我有 4,000 个大的压缩文本文件。由于它们的大小,我需要逐行总结它们。理想情况下(我认为)我想打开一个,然后循环遍历其他 3,999 个,然后简单地将它们的值相加到第一个中。这是我到目前为止所拥有的

with gzip.open('foo1.asc.gz','r') as f:
    for i in xrange(6):  # Header is 6 lines
        f.next()
    line = f.readline()
    foo1=map(float, line.strip().split())
    print foo1

这将返回我需要求和的值foo1;所以输出是一个逗号分隔的浮点数列表(例如,[1.2, 6.0, 9.3...])。

因此,为了清楚起见,如果我要这样做,foo2 = [1.2, 6.0...]那么我可以 sumfoo1foo2得到[2.4, 12.0...], overwriting foo1。然后继续遍历每一行以覆盖foo1. 当然,这需要遍历 4k 文件。

如果有人可以帮助我完成 2 个循环和/或求和运算,我将不胜感激。

* 更新* 现在使用以下代码:

foo1=[]
with gzip.open('foo1','r') as f:
    skip_header()
    for line in f:
        foo1.append([float(i) for i in line.strip().split()])


with gzip.open('foo2','r') as f:
    skip_header()
    for (i, line) in enumerate(f):
        foo1[i] = [j + float(k) for (j, k) in zip(foo1[i], line.strip().split())]

这有效但很慢。我的输入大约需要 11 分钟。

4

3 回答 3

4

以正常的python方式......

你不是在迭代这些线......

~ $ cat test.txt 
1.0
2.0
3.0
4.5
5.0
6.0

但是,您可以阅读所有行,然后对其应用浮点数:

>>> with open('test.txt', 'r') as f:
...      lines = f.readlines()
...      foo1=map(float, lines)
...      print foo1
... 
[1.0, 2.0, 3.0, 4.5, 5.0, 6.0]
>>> sum(foo1)
21.5

但是,您应该使用 NumPy!

总结所有文件的粗略解决方案
import numpy as np

totalsum=0
ListofFiles = ['foo1','foo2']
# from the help of np.loadtxt
# Note that `If the filename extension is .gz or .bz2, the file is first decompressed`
# see the help for that function.
for FileName in ListofFiles:
    totalsum=totalsum+np.sum(np.loadtxt(FileName,skiprows=6))
对来自不同文件的元素求和的解决方案
# use with caution it might hog your memory
import numpy as np

totalsum=0
ListofFiles = ['foo1','foo2']

arrayHolder = np.loadtxt(FileName,skiprows=6)
for idx,FileName in enumerate(ListofFiles[1:]):
    arrayHolder=np.hstack((arrayHolder,np.loadtxt(FileName,skiprows=6)))  
# see documentation for numpy.hstack and my example below.

# now you have a huge numpy array. you can do many things on it
# e.g
# sum each file if the above test.txt had an identical file named test1.txt
np.sum(arrayHolder , axis=0)
# output would be:
array([2.0, 4.0, 6.0, 9.0, 10.0, 12.0])
# sum each ith element accross files
np.sum(arrayHolder , axis=1)

# more extended
In [2]: a=np.array([1.0,2.0,3.0,4.5,5.0,6.0])
In [4]: b=np.array([1.0,2.0,3.0,4.5,5.0,6.0]) 
In [9]: c=np.vstack((a,b))  
In [10]: c
Out[10]:
array([[ 1. , 2. , 3. , 4.5, 5. , 6. ],
[ 1. , 2. , 3. , 4.5, 5. , 6. ]])
In [11]: np.sum(c, axis=0)
Out[11]: array([ 2., 4., 6., 9., 10., 12.])
In [12]: np.sum(c, axis=1)
Out[12]: array([ 21.5, 21.5])

# as I said above this could chocke your memory, so do it gradualy, 
# dont try on all 4000 files at once !

请注意,对于 Pierre 提供的解决方案,该解决方案将运行得更快,因为许多 NumPy 函数是编写的,并且是 C 和优化的。如果您需要在 4000 行上运行,我希望 for 循环会更慢...

于 2012-09-20T12:10:41.360 回答
2

You'll probably have to keep one list in memory, the one storing the lines of your first file.

with gzip.open(...) as f:
    skip_header()
    foo1 = [[float(i) for i in line.strip().split()] for line in f]
  • Note: here, we're building the list at once, meaning that the whole content of f is loaded in memory. That can be an issue if the file is large. In that case, just do:

    foo1 = []
    with gzip.open(...) as f:
        skip_header()
        for line in f:
            foo1.append([float(i) for i in line.strip().split()])
    

Then, you could open a second file, loop on its lines and add the values to the corresponding entry of foo:

with gzip.open(file2) as f:
    skip_header()
    for (i, line) in enumerate(f):
        foo1[i] = [j + float(k) for (j, k) in zip(foo1[i], line.strip().split())]

There shouldn't be much problem, unless you have a different number of columns in your files.

If your file are really large, memory can be an issue. In that case, you may want to work by chunks: read only a few hundred lines from the first file and store them in a list, then proceed as described, using as many lines as you read in the first file, then start again for another few hundred lines...

EDIT

Given the computation times you describe in the edit, this solution is clearly suboptimal. You can't load a whole file in memory, so you'll have to work by chunk. It might be better to follow a workflow as:

  1. Create an empty list foo1.
  2. Open the first file, read a given chunk of lines, transform these lines into a numpy ndarray and append this array to foo1.
  3. Repeat step two for another chunk of lines, till you read the whole input file

At this point, you should have a foo1 list with as many entries as chunks you defined, each entry being a numpy array. Now

  1. Open the second file, read as many lines as you did in step #2, transform these lines into a numpy array foo2_tmp
  2. Add foo2_tmp to foo_1[0], in place: that is, do foo_1[0] += foo2_tmp. Remember, foo_1[0] is your first chunk, a ndarray.
  3. Repeat step 5. for another chunk of lines, and update the corresponding entry in foo_1
  4. Repeat step 6. till you read your second file
  5. Repeat steps 4.-7. for your third file
于 2012-09-20T12:16:16.663 回答
0

This is untested. Note that it's probably in-efficient (and may not even be allowed) to attempt to have 4,000 filehandles open at once, so the file at a time approach is the most practical. Below uses a defaultdict which allows for mis-matching numbers of rows in each file, but still enables summing of overlapping row numbers.

from itertools import islice
from collections import defaultdict
from glob import iglob

def sum_file(filename, dd):
    file_total = 0.0
    with gzip.open(filename) as fin:
        for lineno, line in enumerate(islice(fin, 6, None)): # skip headers
            row_total = map(float, line.split())                
            dd[lineno] += row_total
            file_total += row_total
    return file_total


dd = defaultdict(float)
for filename in iglob('foo*.asc.gz'):
    print 'processed', filename, 'which had a total of', sum_file(filename, dd)

print 'There were', len(dd), 'rows in total'
for lineno in sorted(dd.keys()):
    print lineno, 'had a total of', dd[lineno]
于 2012-09-20T12:18:29.517 回答