python - 从蛋白质数据库 (PDB) 文本文件中提取列

Question

我想用 Python 中的 Matplotlib 制作一个绘图，因此从 PDB 文件（蛋白质数据库）中读取一些数据。我想从文件中提取每一列并将这些列存储在单独的向量中。PDB 文件由包含文本和浮点数的列组成。我对 Matplotlib 很陌生，我尝试了几种方法来提取这些列，但似乎没有任何效果。提取这些列的最佳方法是什么？我将在稍后阶段加载大量数据，所以如果方法不是太低效就好了。

PDB 文件看起来像这样：

ATOM      1  CA  MET A   1      38.012   8.932  -1.253
ATOM      2  CA  GLU A   2      39.809   5.652  -1.702
ATOM      3  CA  ALA A   3      43.007   5.013   0.368
ATOM      4  CA  ALA A   4      41.646   7.577   2.820
ATOM      5  CA  HIS A   5      42.611   4.898   5.481
ATOM      6  CA  SER A   6      46.191   5.923   5.090
ATOM      7  CA  LYS A   7      45.664   9.815   5.134
ATOM      8  CA  SER A   8      45.898  12.022   8.181
ATOM      9  CA  THR A   9      42.528  13.075   9.570
ATOM     10  CA  GLU A  10      43.330  16.633   8.378
ATOM     11  CA  GLU A  11      44.171  15.729   4.757
ATOM     12  CA  CYS A  12      40.589  14.150   4.745
ATOM     13  CA  LEU A  13      38.984  17.314   6.105
ATOM     14  CA  ALA A  14      40.633  19.053   3.220
ATOM     15  CA  TYR A  15      39.740  16.682   0.505
ATOM     16  CA  PHE A  16      36.138  17.421   1.566
ATOM     17  CA  GLY A  17      36.536  20.854   2.826
ATOM     18  CA  VAL A  18      34.184  20.012   5.553
ATOM     19  CA  SER A  19      34.483  20.966   9.177

score 1 · Accepted Answer

蛋白质数据库 (pdb) 文件格式是一种文本文件格式，描述蛋白质数据库中保存的分子的三维结构。因此，pdb 格式提供了蛋白质和核酸结构的描述和注释，包括原子坐标、观察到的侧链旋转异构体、二级结构分配以及原子连接性。我在 google 上找到了这个。

至于提取列，你也可以在 google 或 wiki 上找到答案。

score 0 · Accepted Answer

脱离@Kyle_S-C 的建议，这里有一种使用 Biopython 的方法。

首先将您的文件读入 BiopythonStructure对象：

import Bio.PDB
path = '/path/to/PDB/file' # your file path here
p = Bio.PDB.PDBParser()
structure = p.get_structure('myStructureName', path)

然后，例如，您可以获得一个仅包含 Atom id 的列表，如下所示：

ids = [a.get_id() for a in structure.get_atoms()]

有关更多信息，请参阅Biopython 结构生物信息学常见问题解答，包括以下用于访问 Atom 的 PDB 列的方法：

如何从 Atom 对象中提取信息？

使用以下方法：

# a.get_name()           # atom name (spaces stripped, e.g. 'CA')
# a.get_id()             # id (equals atom name)
# a.get_coord()          # atomic coordinates
# a.get_vector()         # atomic coordinates as Vector object
# a.get_bfactor()        # isotropic B factor
# a.get_occupancy()      # occupancy
# a.get_altloc()         # alternative location specifier
# a.get_sigatm()         # std. dev. of atomic parameters
# a.get_siguij()         # std. dev. of anisotropic B factor
# a.get_anisou()         # anisotropic B factor
# a.get_fullname()       # atom name (with spaces, e.g. '.CA.')

score 0 · Accepted Answer

本教程可能会有所帮助：https ://py-packman.readthedocs.io/en/latest/tutorials/molecule.html#tutorials-molecule

from packman import molecule

Protein = molecule.load_structure('/path/to/PDB/file.pdb')
#molecule.download_structure('1prw','1prw.pdb') if you want to download PDB file 1prw.pdb


for i in Protein[0].get_atoms():
    #Iterating over atom objects (parent= residue)
    print(i.get_name(), i.get_id(), i.get_location(), i.get_parent().get_name())

上面提供了获取原子名称的方法，即.. i.get_name()，原子的 id，即.. i.get_id() 等。

可以提取 PDB 文件的所有组件。有关详细信息，请阅读 PACKMAN 文档。

披露：包 py-packman 的作者

python - 从蛋白质数据库 (PDB) 文本文件中提取列

3 回答 3

Related

Reference