python - For循环遍历Python中文本文件的重复部分

Question

我对编程和 Python 非常陌生，我正在尝试将 DLPOLY HISTORY 文件转换为 arc 文件。我需要做的是提取晶格向量（单词 timestep 下的 3x3 数组）、x、y 和 z 坐标（每个元素下面的行上的三个条目）和电荷（行上的第四个条目）元素）。

理想情况下，我希望最终能够转换任意大小和帧长度的文件。

DLPOLY HISTORY 文件的两个标题行和前两帧如下所示：

File Title
         0         3         5                  136                 1906
timestep         0         5 0 3            0.000500            0.000000
        3.5853000000        0.0000000000        0.0000000000
       -1.7926500000        3.1049600000        0.0000000000
        0.0000000000        0.0000000000        4.8950000000
Ca               1   40.078000    1.050000    0.000000
     0.000000000         0.000000000         0.000000000
O                2   15.999400   -0.950000    0.000000
     1.792650000        -1.034986100         1.140535000
H                3    1.007940    0.425000    0.000000
     1.792650000        -1.034986100         1.933525000
O                4   15.999400   -0.950000    0.000000
    -1.792650000         1.034987000        -1.140535000
H                5    1.007940    0.425000    0.000000
    -1.792650000         1.034987000        -1.933525000
timestep        10         5 0 3            0.000500            0.005000
         3.5853063513        0.0000000000        0.0000000000
        -1.7926531756        3.1049655004        0.0000000000
         0.0000000000        0.0000000000        4.8950086714
Ca               1   40.078000    1.050000    0.020485
    -0.1758475885E-01    0.1947928245E-04   -0.1192033544E-01
O                2   15.999400   -0.950000    0.051020
     1.841369991        -1.037431082         1.120698646 
H                3    1.007940    0.425000    0.416965
     1.719029690        -1.029327936         2.355541077
O                4   15.999400   -0.950000    0.045979
    -1.795057186         1.034993005        -1.093028694
H                5    1.007940    0.425000    0.373772 
    -1.754959531         1.067269072        -2.320776528

到目前为止，我拥有的代码是：

fileList = history_file.readlines()
number_of_frames = int(fileList[1].split()[3])
number_of_lines = int(fileList[1].split()[4])
frame_length = (number_of_lines - 2) / number_of_frames
number_of_atoms = int(fileList[1].split()[2])
lines_per_atom = frame_length / number_of_atoms

for i in range(3, number_of_lines+1, frame_length):

#maths for converting lattice vectors
#print statement to write out converted lattice vectors

    for j in range(i+3, frame_length+1, lines_per_atom):
             atom_type = fileList[j].split()[0]
             atom_x = fileList[j+1].split()[0]
             atom_y = fileList[j+1].split()[1]
             atom_z = fileList[j+1].split()[2]
             charge = fileList[j].split()[3]
             print atom_type, atom_x, atom_y, atom_z, charge

我可以提取和转换晶格向量，所以这不是问题。但是，当涉及到第二个 for 循环时，它只执行一次，它认为我的范围结束语句

frame_length+1

不正确，但是如果我将其更改为

 i+3+frame_length+1

我收到以下错误：

charge = fileList[j].split()[3]
IndexError: list index out of range

我认为这意味着我要遍历数组的末尾。

我确信我忽略了一些非常简单的事情，但任何帮助将不胜感激。

我还想知道是否有更有效的读取文件的方法，因为据我了解，readlines 会将整个文件读入内存，而 HISTORY 文件的大小很容易达到几 GB。

score 1 · Accepted Answer

好的，我们可以使用您提供的示例值进行相当简单的检查来发现问题。如果我们输入以下代码

for i in range(3,1907,136):
    for j in range(i+3,137,2):
        print i,j

我们得到这个：

这是您遇到的错误。循环似乎只迭代一次。但是，如果我们稍微更改代码，我们就会看到问题的根源。如果我们跑

for i in range(3,1907,136):
    print "i:", i,
    for j in range(i+3,137,2):
        print "j:", j

我们得到这个：

i: 3 j: 6
j: 8
j: 10
j: 12
...
j: 134
j: 136
i: 139 i: 275 i: 411 i: 547 i: 683 i: 819 i: 955 i: 1091 i: 1227 i: 1363 i: 1499
 i: 1635 i: 1771

因此，您可以看到内循环（j 循环）第一次运行，一旦完成，外循环（i 循环）就会一直运行，而不会让内循环运行。这是因为您range在内部循环中设置的方式。在第一次运行时，它的计算结果为，range(3,137,2)但在第二次运行时，range(142,137,2)因为i在第二次运行时从 139 开始。它在开始之前已经终止。

为了得到你想要的（或者我认为你想要的），内部循环是这样的：

for j in range(4,frame_length,line_per_atom):
    atom_type = fileList[j+i].split()[0]

这使得j每帧中的行迭代器超过第 4 行

但我还没有弄清楚你的代码是如何工作的。 我手工计算了您示例中的值作为检查。

frame_length = (1906 - 2) / 136 = 14
lines_per_atom = 14 / 5 = 2.8

2.8 的Alines_per_atom是非法的，它必须是整数，我不知道你怎么没有得到TypeError. lines_per_atom 的计算应该是lines_per_atom = (frame_length - 4) / number_of_atoms

无论如何，希望这有效！

（另外，将来尝试使用驼峰式大小写来代替下划线。因此，在我看来，输入起来更lines_per_atom容易linesPerAtom）

python - For循环遍历Python中文本文件的重复部分

1 回答 1

Related

Reference