python - Biopython 类实例 - 来自 Entrez.read 的输出：我不知道如何操作输出

Question

我正在尝试从 Pubmed 下载一些 xml - 没有问题，Biopython 很棒。问题是我真的不知道如何操作输出。我想将大部分解析的xml放入sql数据库，但我对输出不熟悉。对于某些事情，我可以将解析后的 xml 称为字典，但对于其他事情，这似乎并不那么简单。

from Bio import Entrez
Entrez.email="xxxxxxxxxxxxx@gmail.com"
import sqlite3 as lite
handle=Entrez.efetch(db='pubmed',id='22737229', retmode='xml')
record = Entrez.read(handle)

如果我想找到标题，我可以这样做：

title=record[0]['MedlineCitation']['Article']['ArticleTitle']

但是解析出来的对象的类型是一个类：

>>> type(record)
<class 'Bio.Entrez.Parser.ListElement'>
>>>r=record[0]
>>>type(r)
<class 'Bio.Entrez.Parser.DictionaryElement'>
>>> r.keys()
[u'MedlineCitation', u'PubmedData']

这让我觉得一定有比将它用作字典更简单的方法。但是当我尝试时：

>>> r.MedlineCitation

Traceback (most recent call last):
  File "<pyshell#67>", line 1, in <module>
    r.MedlineCitation
AttributeError: 'DictionaryElement' object has no attribute 'MedlineCitation'

它不起作用。我显然可以将它用作字典，但后来我遇到了问题。

真正的问题是在像字典一样使用记录时试图从记录中获取某些信息：

>>> record[0]['MedlineCitation']['PMID']
StringElement('22737229', attributes={u'Version': u'1'})

这意味着我不能只是扑通一声（这是一个技术术语；）它到我的 sql 数据库中，但需要转换它：

>>> t=record[0]['MedlineCitation']['PMID']
>>> t
StringElement('22737229', attributes={u'Version': u'1'})
>>> int(t)
22737229
>>> str(t)
'22737229'

总而言之，我很高兴 Entrez.read() 提供的信息的深度，但我不确定如何在生成的类实例中轻松使用这些信息。通常你可以做类似的事情

record.MedlineCitation

但它不起作用。

干杯

惠顿

score 4 · Accepted Answer

该Entrez.read()方法将返回一个嵌套数据结构，由ListElements 和DictionaryElements 组成。有关更多信息，请查看readbiopython 源代码中的方法文档，我将在下面摘录和解释：

def read(handle, validate=True):

This function parses an XML file created by NCBI's Entrez Utilities,
returning a multilevel data structure of Python lists and dictionaries.
...
the[se] data structure[s] seem to consist of generic Python lists,
dictionaries, strings, and so on, [but] each of these is actually a class
derived from the base type. This allows us to store the attributes
(if any) of each element in a dictionary my_element.attributes, and
the tag name in my_element.tag.

该包的作者Michiel de Hoon还花一些时间在Parser.py源文件的最顶部讨论他使用自定义s 和 s in表示 XML 文档的动机ListElementDictionaryElementEntrez。

如果您非常好奇，您还可以阅读源代码中令人着迷的ListElement、DictionaryElement和StructureElement类的声明。我会破坏这个惊喜，只是让你知道它们是对其基本 Python 数据类型的非常轻量级的包装器，并且其行为几乎与其底层基本数据类型完全相同，只是它们有一个新属性attributes，它捕获 XML 属性（键和值）用于正在解析的文档中的每个 XML 节点read。

因此，您的问题的基本答案是，没有“简单”的方法可以使用点运算符语法来处理 a 的键DictionaryElement。如果你有一个字典元素 d，那么：

>>> d
DictElement({'first_name': 'Russell', 'last_name': 'Jones'}, attributes={'occupation': 'entertainer'})

您可以阅读的唯一内置方法first_name是使用普通的 python 字典 API，例如：

>>> d['first_name']
'Russell'
>>> d.get('first_name')
'Russell'
>>> d.get('middle_name', 'No Middle Name')
'No Middle Name'

Don't lose heart, this really isn't so bad. If you want to take certain nodes and insert them into rows of a sqlite database, you can just write small methods that take a DictElement as input, and return something sqlite can accept as output. If you're having trouble with this, feel free to post another question specifically about that.

score 1 · Accepted Answer

我不确定这是否正确，但我相信“记录”是字典列表。所以你需要使用循环来获取每个字典

就像是

for r in record:
    r['MedlineCitation']

python - Biopython 类实例 - 来自 Entrez.read 的输出：我不知道如何操作输出

2 回答 2

Related

Reference