python - 使用python从同一目录中的多个文件中提取特定行

Question

我有多个名为 ParticleCoordW_10000.dat、ParticleCooordW_20000.dat 等的文本文件……这些文件都看起来像这样：

ITEM: TIMESTEP
10000
ITEM: NUMBER OF ATOMS
1000
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
ITEM: ATOMS id x y z 
673 1.03559 0.495714 0.575399 
346 2.74458 1.30048 0.0566235 
991 0.570383 0.589025 1.44128 
793 0.654365 1.33452 1.91347 
969 0.217201 0.6852 0.287291
.
. 
. 
.

我想用python提取单个粒子的坐标，让我们说ATOM ID：673。问题是ATOM ID：673的行位置在每个文本文件中都会发生变化。所以我想让 Python 能够在目录的每个文本文件中找到 ATOM #673 并保存相关的 xyz 坐标。

以前我使用这样的东西来获取所有坐标：

filenames = glob.glob('*.dat')
for f in filenames:
    x_data = np.loadtxt(f,usecols=[1],skiprows = 9)
    y_data = np.loadtxt(f,usecols=[2],skiprows = 9)
    z_data = np.loadtxt(f,usecols=[3],skiprows = 9)
    coord  = np.vstack((x_data,y_data,z_data)).T

有没有办法修改这个脚本以执行前面描述的任务？

编辑：根据各种评论，我写了以下内容：

coord = []
filenames = natsort.natsorted(glob.glob('*.dat'))
for f in filenames:
    buff = open(f, 'r').readlines()
    for row in buff:
        if row.startswith('673'):
            coord.append(row)
np.savetxt("xyz.txt",coord,fmt,delimiter=' ')

这允许我在目录中的所有文本文件中对单个粒子的所有坐标进行分组。但是，我希望为所有粒子 ID（1000 个粒子）完成此过程。最有效的方法是什么？

score 0 · Accepted Answer

您可以使用正则表达式从所有文件中获取数据，然后根据需要处理它们。这样的事情可能会奏效。

我假设文件中的坐标值之后没有任何内容。您必须从所有文件所在的目录运行此脚本。

import os, re

regex = r"^ITEM: ATOMS \d+ x y z.*" # basing on this line being "ITEM: ATOMS 675 x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[temp[0].split()[2]] = temp[1:]

这将为您提供一个字典，其中ATOM ID作为键和所有坐标的列表作为值。样本输出：

output

{'675': ['673 1.03559 0.495714 0.575399 ',
  '346 2.74458 1.30048 0.0566235 ',
  '991 0.570383 0.589025 1.44128 ',
  '793 0.654365 1.33452 1.91347 ',
  '969 0.217201 0.6852 0.287291',
  '']}

在审查了这个问题后，我认为我误解了输入。该行在ITEM: ATOMS id x y z所有文件中都是静态的。所以，我稍微改变了代码。

import os, re

regex = r"^ITEM: ATOMS id x y z.*" # basing on this line being exactly "ITEM: ATOMS id x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[file] = temp[1:] # storing against filename as key

score 0 · Accepted Answer

如果没有更多背景，我无法想象一种方法来找到正确的行而不读取您的 Atom Id 所在的行。

你做这样的事情：

with open(FILE) as f:
    for line in f:
        if line.startswith(ID,0,log10(NumberOfAtoms)):
            saverownumber() or extract information

否则，您可以保存/读取每个文件的“映射”ID <-> 行号

但是我认为您应该考虑一种以有序方式保存位置的方法。也许你也可以在你的问题中提供信息，是什么阻止你保存按 Atom ID 排序的位置。

我可以推荐使用hdf5 库来存储带有元数据的大型数据集。

python - 使用python从同一目录中的多个文件中提取特定行

2 回答 2

Related

Reference