python - 在 python 中读取一个巨大的 xyz 文件（200M 行，~15 GB）并对其执行计算

Question

我有一个分子模拟软件的 .xyz 文件的输出。我需要计算两组原子之间的距离。第一行是原子数（1046），下一行可以看作是我不需要的注释。然后是原子的坐标。该文件的一般视图如下。

    1046
 i =    13641, time =     5456.400, E =     -6478.1065220464
  O        -7.4658679231       -8.2711817669       -9.8539631371
  H        -7.6241360163       -9.2582538006       -9.9522290769
  H        -8.2358222851       -7.6941601822       -9.9653770757
  O        -4.9711266650       -4.7190696213      -15.2513827675
  H        -4.0601366272       -4.8452939622      -14.9451462873
  H        -5.3574156180       -5.6550789412      -15.1154558067
... ~ 1000 more lines
  O        -3.7163764338      -18.4917410571      -11.7137020838
  H        -3.3000068500      -18.5292231200      -12.6331415740
  H        -4.3493512803      -19.2443154891      -11.6751925772
    1046
 i =    13642, time =     5456.800, E =     -6478.1027892656
  O        -7.4669935102       -8.2716185134       -9.8549232159
  H        -7.6152044024       -9.2599276969       -9.9641510528
  H        -8.2364333010       -7.6943001983       -9.9565217204
  O        -4.9709831745       -4.7179801609      -15.2530422573
  H        -4.0670595153       -4.8459686871      -14.9472675802
  H        -5.3460565517       -5.6569802374      -15.1037050119
...

我要提取的第一组原子的索引类似于 0,3,6,9,...,360。考虑到文件中的前两行标题，我将其实现为

w_index = list(range(2,362,3))

另一组只有 8 个原子，我再次将其作为列表给出。

c_index = [420,488,586,688,757,847,970,1031]

我的想法是将相应的行附加到单独的列表中，以便通过函数'dist_calculate_with_pbc_correction'对它们进行操作。

def open_file(filename):

    global step

    waters = []
    carbons = []
    step=0

    with open(filename, 'r') as infile:

        for index, line in enumerate(infile):
            items = line.split()

            if index % (natoms+2) in w_index:
                kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
                waters.append([kind,x,y,z])

            if (index - 2)% (natoms+2) in c_index:
                kind, x, y, z = items[0], float(items[1]), float(items[2]), float(items[3])
                carbons.append([kind,x,y,z])

            if index > 0 and index % (natoms+2) == natoms:
                dist_calculate_with_pbc_correction(carbons,waters)
                carbons, waters = [], []
                step+=1

这是执行所有计算并将结果写入文件的函数。

def dist_calculate_with_pbc_correction(c,w):

    write_file(output_fn,'%s %r \n' % ('#Step:',step))

    for i in range(len(c)):
        hydration = 0
        min_d = 100
        for j in range(len(w)):

            x_diff, y_diff, z_diff = c[i][1] - w[j][1], c[i][2] - w[j][2], c[i][3] - w[j][3]

            if x_diff > pbc_a/2:
                x_diff -= pbc_a
            elif x_diff < -pbc_a/2:
                x_diff += pbc_a

            if y_diff > pbc_b/2:
                y_diff -= pbc_b
            elif y_diff < -pbc_b/2:
                y_diff += pbc_b

            if z_diff > pbc_c/2:
                z_diff -= pbc_c
            elif z_diff < -pbc_c/2:
                z_diff += pbc_c

            dist = math.sqrt(x_diff**2 + y_diff**2 + z_diff**2)
            if dist < min_d:
                min_d, min_index = dist, w_index[j]-2
            if dist < r_cutoff :
                hydration +=1

        write_file(output_fn,'%d %s %d %d \n' % (c_index[i], round(min_d,3), min_index, hydration))


def write_file(filename,out):
    with open(filename,'a') as g:

        g.write(out)

问题是，这段代码可以工作，但速度不够快（阅读速度极慢）。浏览约 200K 步（约 200M 行）的整个文件大约需要 27 分钟。我说“非常慢”，因为我的一位同事在 Fortran 中提出了她自己的版本 - 如果不是更多的话，它会做完全相同的事情 - 运行速度快 10 倍。她的代码在 3 分钟内运行整个文件，尽管她费心计算以确定所有原子之间的最小距离，这与我指定原子索引的方式不同。我知道大部分时间都花在选择“open_file”函数中的原子以将它们添加到相应的列表中，但我不知道如何改进它。任何帮助将不胜感激。

如果您想要一个方便的示例文件，这里是一个示例 xyz 碳文件制作器。

import numpy as np
def sample_xyz(steps,natoms):

    with open('sample_xyz.xyz','w') as f:
        for step in range(steps):
            f.write(str(natoms)+'\n')
            f.write('i = {} \n'.format(step))

            for foo in range(natoms):
                line = 'C {} {} {}\n'.format(np.random.random()*10, np.random.random()*10, np.random.random()*10)
                f.write(line)

python - 在 python 中读取一个巨大的 xyz 文件（200M 行，~15 GB）并对其执行计算

0 回答 0

Related

Reference