python - 如何使用 python 有效地加载这种 ASCII 文件？

Question

我有大的 fortran 生成的 ASCII 文件，格式如下：

x y z  num_line index
1 float
2 float
...
num_line float
x2 y2 z2 num_line2 index2
1 float
2 float
...
num_line2 float
...

块的数量可以达到数千，每个块中的行数可以达到数百。

让我们举一个我得到的例子：

0.0 0.0 0.0  4 0
1 0.5
2 0.9
3 0.4
4 0.1
0.0 0.0 1.0  4 1
1 0.2
2 0.2
3 0.4
4 0.9
0.0 1.0 2.0  5 2
1 0.7
2 0.6
3 0.9
4 0.2
5 0.7

而我想要的（作为一个numpy矩阵）：

0.5 0.2 0.7
0.9 0.2 0.6
0.4 0.4 0.9
0.1 0.9 0.2
nan nan 0.7

当然，我可以使用：

my_mat = []
with open("myfile", "r") as f_in:
    niter = int(f_in.readline().split()[3])
    while niter:
        curr_vect = zeros(niter)
        for i in xrange(niter):
            curr_vect[i] = float(f_in.readline().split()[1])
        my_mat.append(curr_vect)
        line = f_in.readline()
        if line is not None:
            niter = int(line.split()[3])
        else:
            niter = False
my_mat = array(my_mat)

问题是这不是很有效，而且对于它的本质来说太复杂了。我已经知道 numpy loadtxt，genfromtxt但它们似乎不适用于那里。

我正在寻找更快、更具可读性的东西。任何的想法？

编辑：

请原谅我，我的问题不完整，你们中的一些人因为我而浪费了他的时间。这是这样一个块的真实示例：

3.571428571429E-02 3.571428571429E-02-3.571428571429E-02         1   35  
       1 -0.493775207966779     
       2  0.370269037864060     
       3  0.382332033744703     
       4  0.382332033744703     
       5  0.575515346181205     
       6  0.575515346181216     
       7  0.575562530624028     
       8  0.639458035564442     
       9  0.948445367602052     
      10  0.948445367602052     
      11  0.975303238888803     
      12   1.20634795229899     
      13   1.21972845646758     
      14   1.21972845646759     
      15   1.52659950368213     
      16   2.07381346028515     
      17   2.07629743909555     
      18   2.07629743909555     
      19   2.15941179949552     
      20   2.15941179949552     
      21   2.30814240005132     
      22   2.30814240005133     
      23   2.31322868361483     
      24   2.53625115348660     
      25   2.55301153157825     
      26   2.55301153157826     
      27   2.97152031842301     
      28   2.98866790318661     
      29   2.98866790318662     
      30   3.24757159459268     
      31   3.27186643004142     
      32   3.27186643004143     
      33   3.37632477135298     
      34   3.37632477135299     
      35   3.55393884607834

score 2 · Accepted Answer

您可以使用 numpy.genfromtxt：

读取一列，由换行符分隔 \n
提供自定义转换器功能

例子：

import numpy as np
from StringIO import StringIO

# your data from above as string
raw = '''0.0 0.0 0.0  4 0
1 0.5
...
5 0.7
'''

这是转换器：

def custom_converter(line):
    token = line.split()
    if len(token) == 2:
        return float(token[1])
    else:
        return np.NaN

加载数据：

data = np.genfromtxt(StringIO(raw),
                     delimiter='\n',
                     converters={0: custom_converter})

print data

打印：

[ nan  0.5  0.9  0.4  0.1  nan  0.2  0.2  0.4  0.9  nan  0.7  0.6  0.9  0.2
  0.7]

现在你建立了最终的数据结构：

delims, = np.where(np.isnan(data))
max_block = np.max(np.diff(delims))
nblocks = delims.size
final_data = np.empty([max_block, nblocks]) + np.NaN

delims = delims.tolist()
delims.append(data.size)
low = delims[0] + 1
for i, up in enumerate(delims[1:]):
    final_data[0: up-low , i] = data[low:up]
    low = up + 1

print final_data

哪个打印

[[ 0.5  0.2  0.7]
 [ 0.9  0.2  0.6]
 [ 0.4  0.4  0.9]
 [ 0.1  0.9  0.2]
 [ nan  nan  0.7]]

score 1 · Accepted Answer

import numpy as np
from itertools import groupby,izip_longest

def f1(fname):
    with open(fname) as f:
        return np.matrix(list(izip_longest(
               *(map(lambda x: float(x[1]),v)
               for k,v in groupby(map(str.split,f),
               key=lambda x: len(x) == 2) if k),
               fillvalue=np.nan)))
d1('testfile')

出去：

matrix([[ 0.5,  0.2,  0.7],
        [ 0.9,  0.2,  0.6],
        [ 0.4,  0.4,  0.9],
        [ 0.1,  0.9,  0.2],
        [ nan,  nan,  0.7]])

编辑：

至于性能，我针对np.genfromtxt解决方案@TheodrosZelleke对其进行了测试，它似乎快了大约五倍。

score 0 · Accepted Answer

如果您确保每个块具有相同数量的行（即使用零填充），它会快得多。这样你就可以重塑一个用loadtxt. 但是考虑到这个限制，这里有一个可能会更快的例子：

import numpy as np

data = np.loadtxt("myfile", usecols=(0, 1), unpack=True)
nx = np.sum(data[0] == 0)
ny = np.max(data[0])
my_mat = np.empty((nx, ny), dtype='d')
my_mat[:] = np.nan   # if you really want to populate it with NaNs for missing
tr_ind = data[0, list(np.nonzero(np.diff(data[0]) < 0)[0]) + [-1]].astype('i')
buf = np.squeeze(data[1, np.nonzero(data[0])])

idx = 0
for i in range(nx):
    my_mat[i, :tr_ind[i]] = buf[idx : idx + tr_ind[i]]
    idx += tr_ind[i]

您可以检查结果：

>>> print my_mat.T
array([[ 0.5,  0.2,  0.7],
       [ 0.9,  0.2,  0.6],
       [ 0.4,  0.4,  0.9],
       [ 0.1,  0.9,  0.2],
       [ nan,  nan,  0.7]])

更新：正如 TheodrosZelleke 所指出的，当x2（第一列）非零时，上述解决方案失败。我第一次没有注意到这一点。这是解决此问题的更新：

# this will give a conversion warning because column number varies
blk_sizes = np.genfromtxt("myfile", invalid_raise=False, usecols=(-2,))

nx = blk_sizes.size
ny = np.max(blk_sizes)

data = np.loadtxt("myfile", usecols=(1,))
my_mat = np.empty((nx, ny), dtype='d')
my_mat[:] = np.nan

idx = 1
for i in range(nx):
    my_mat[i, :blk_sizes[i]] = data[idx : idx + blk_sizes[i]]
    idx += blk_sizes[i] + 1

（然后采取my_mat.T。）

python - 如何使用 python 有效地加载这种 ASCII 文件？

3 回答 3

Related

Reference