python - 用python从文本中解析id

Question

我有这段文字：

>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]

从此文本中，我想解析 |gb| 之后的 ID 并将其写在列表中。

我尝试使用正则表达式，但未能成功。

score 3 · Accepted Answer

在管道上拆分|，然后跳过所有内容，直到第一个gb；下一个元素是 ID：

from itertools import dropwhile

text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)

示范：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'

换句话说，不需要正则表达式。

将其变成生成器方法以获取所有 ID：

from itertools import dropwhile

def extract_ids(text):
    text = iter(text.split('|'))
    while True:
        next(dropwhile(lambda s: s != 'gb', text))
        yield next(text)

这给出了：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']

或者你可以在一个简单的循环中使用它：

for id in extract_ids(text):
    print id

score 2 · Accepted Answer

2

正则表达式应该工作

import re
re.findall('gb\|([^\|]*)\|', 'gb|AB1234|')

于 2013-02-13T20:59:40.487 回答

score 1 · Accepted Answer

在这种情况下，您可以不使用正则表达式，只需用“|gb|”分割，然后用“|”分割第二部分并采取第一项：

s = 'the string from the question'
r = s.split('|gb|')
r.split('|')[0]

当然，您必须添加检查第一个拆分返回列表是否包含更多/少于 2 个项目，但我认为它会比使用正则表达式更快。

score 1 · Accepted Answer

>>> import re
>>> match_object = re.findall("\|gb\|(.*?)\|", ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]")
>>> print match_object
['EDL26483.1', 'AAI37799.1']

正则表达式表示“匹配任何字符 (.)，重复 (*)，但尽可能少 (?)，并且只保存该组（括号）。它们必须紧跟在 '|gb|' 之后并紧接在另一个'|'之前。”

我用“\|” 因为“|” 字符表示正则表达式中的替代匹配。

score 0 · Accepted Answer

In [1]: import re

In [2]: text = ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]"

In [3]: re.findall(r'gb\|([^\|]+)', text)[0]
Out[3]: 'EDL26483.1'

score 0 · Accepted Answer

re.findall('gi\|([0-9]+)\|', u'''>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]''')为我工作： [u'124486857', u'341941060', u'148694536', u'223460980']

score 0 · Accepted Answer

假设a是保存你的字符串的变量......

>>> import re
>>> a = ">gi|124486857|ref|NP_001074751.1| ..."
>>> re.findall(r"(?:\|gb\|)([a-zA-Z0-9.]+)(?:\|)", a)
['EDL26483.1', 'AAI37799.1']

python - 用python从文本中解析id

7 回答 7

Related

Reference