python-3.x - 是否有代码可以从带有 ID 列表的大型 sdf 文件中提取完整的分子记录？

Question

我在 Spider 中使用 Phyton 3.7

我尝试从一个大 sdf 文件中提取完整的分子记录，其中包含一个 txt 文件中列出的小 ID 列表，并将它们写入一个新的 sdf 文件。

更具体地说，我有一个大约 500 个化学分子 ID 的选定列表，每行一个 ID（每个 ID 十位数），其分子详细信息包含在一个大约 2G 的大型 sdf 文件中（300000 个分子，每个记录包含大约 400他们的 ID 到最后的 $$$$ 行之间的代码行）

我需要从大型 sdf 2G 文件中将完整的 500 条 ID 记录提取到单个 sdf 文件中以供进一步研究。

我尝试了来自 stackoverflow 和 google 的类似的部分 python 脚本，但没有一个工作！任何人都可以给出提示或几行代码来测试吗？

谢谢你，朱利奥

按照建议（谢谢 Andrej：好主意），为了简化问题，我设计了文件的小样本。在原始文件中，每一行都由 \n 分隔。我将位置信息添加到每条记录中，以方便跟进结果。f1.txt 包含 3 个 ID f2.sdf 包含大型 2G 数据库的简化样本 f3.sdf 包含所需的文件，在本例中为 3 个 ID

f1.txt

SN00061212
SN00134795
SN00107686

f2.sdf

SN00039109
 MOLSOFT 05232012283D, 1 in the large sdf list

大约 400 多行代码

$$$$
SN00357061
 MOLSOFT 05232012283D, 2 in the large sdf list,

大约 400 多行代码

$$$$
SN00134795
 MOLSOFT 05232012283D, 3 in the large sdf list

大约 400 多行代码

   $$$$
SN00061212
 MOLSOFT 05232012283D, 4 in the large sdf list

大约 400 多行代码，一个在短 txt 列表 SN1 中

  $$$$
SN00134796
 MOLSOFT 05232012283D, 5 in the large sdf list

大约 400 多行代码

  $$$$
SN00134795
 MOLSOFT 05232012283D, 6 in the large sdf list

大约 400 多行代码，一个在短 txt 列表 SN2 中

  $$$$
SN00333333
 MOLSOFT 05232012283D, 7 in the large sdf list

大约 400 多行代码

  $$$$
SN00145791
  MOLSOFT 05232012283D, 8 in the large sdf list

大约 400 多行代码

  $$$$
SN00107686
 MOLSOFT 05232012283D, 9 in the large sdf list

大约 400 多行代码，一个在短 txt 列表 SN3 中

$$$$

f3.sdf

SN00061212
 MOLSOFT 05232012283D, 4 in the large sdf list

大约 400 多行代码，一个在短 txt 列表 SN1 中

  $$$$
SN00134795
 MOLSOFT 05232012283D, 6 in the large sdf list

大约 400 多行代码，一个在短 txt 列表 SN2 中

  $$$$
SN00107686
 MOLSOFT 05232012283D, 9 in the large sdf list

大约 400 多行代码，一个在短 txt 列表 SN3 中

$$$$

score 0 · Accepted Answer

您可以使用re模块来完成任务：

如果f1.txt包含：

SN00061212
SN00134795
SN00107686

f2.sdf包含：

SN00039109
 MOLSOFT 05232012283D

about 400 lines more of code

$$$$
SN00357061
 MOLSOFT 05232012283D

about 400 lines more of code

$$$$
SN00061212
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

  $$$$
SN00134796
 MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00134795
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

  $$$$
SN00333333
 MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00145791
  MOLSOFT 05232012283D

about 400 lines more of code

  $$$$
SN00107686
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3

$$$$

然后这个脚本：

import re

with open('f1.txt', 'r') as f_in:
    desired_ids = set(line.strip() for line in f_in if line.strip())

expr = r'({}.*?^\s*\$\$\$\$)'.format(r'^\s*(?:' + r'|'.join(re.escape(i) for i in desired_ids) + r')')
r = re.compile(expr, flags=re.DOTALL|re.M)

with open('f2.sdf', 'r') as f_in, open('f3.sdf', 'w') as f_out:
    for m in r.finditer(f_in.read()):
        print(m.group(0), file=f_out)

生产f3.sdf：

SN00061212
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN1

  $$$$
SN00134795
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN2

  $$$$
SN00107686
 MOLSOFT 05232012283D

about 400 lines more of code, one in the short txt list SN3

$$$$

编辑：

你可以在regex101上看到正则表达式

这re.DOTALL意味着点.字符也匹配换行符。re.M（或re.MULTILINE）表示该字符^将匹配行的开头，而不仅仅是文件的开头。更多在官方re文档中。

python-3.x - 是否有代码可以从带有 ID 列表的大型 sdf 文件中提取完整的分子记录？

1 回答 1

Related

Reference