2

我有这段文字:

>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]

从此文本中,我想解析 |gb| 之后的 ID 并将其写在列表中。

我尝试使用正则表达式,但未能成功。

4

7 回答 7

3

在管道上拆分|,然后跳过所有内容,直到第一个gb;下一个元素是 ID:

from itertools import dropwhile

text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)

示范:

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'

换句话说,不需要正则表达式。

将其变成生成器方法以获取所有 ID:

from itertools import dropwhile

def extract_ids(text):
    text = iter(text.split('|'))
    while True:
        next(dropwhile(lambda s: s != 'gb', text))
        yield next(text)

这给出了:

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']

或者你可以在一个简单的循环中使用它:

for id in extract_ids(text):
    print id
于 2013-02-13T21:00:15.327 回答
2

正则表达式应该工作

import re
re.findall('gb\|([^\|]*)\|', 'gb|AB1234|')
于 2013-02-13T20:59:40.487 回答
1

在这种情况下,您可以不使用正则表达式,只需用“|gb|”分割,然后用“|”分割第二部分 并采取第一项:

s = 'the string from the question'
r = s.split('|gb|')
r.split('|')[0]

当然,您必须添加检查第一个拆分返回列表是否包含更多/少于 2 个项目,但我认为它会比使用正则表达式更快。

于 2013-02-13T21:03:35.520 回答
1
>>> import re
>>> match_object = re.findall("\|gb\|(.*?)\|", ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]")
>>> print match_object
['EDL26483.1', 'AAI37799.1']

正则表达式表示“匹配任何字符 (.),重复 (*),但尽可能少 (?),并且只保存该组(括号)。它们必须紧跟在 '|gb|' 之后 并紧接在另一个'|'之前。”

我用“\|” 因为“|” 字符表示正则表达式中的替代匹配。

于 2013-02-13T21:04:50.673 回答
0
In [1]: import re

In [2]: text = ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]"

In [3]: re.findall(r'gb\|([^\|]+)', text)[0]
Out[3]: 'EDL26483.1'
于 2013-02-13T21:01:38.833 回答
0

re.findall('gi\|([0-9]+)\|', u'''>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]''')为我工作: [u'124486857', u'341941060', u'148694536', u'223460980']

于 2013-02-13T21:02:36.243 回答
0

假设a是保存你的字符串的变量......

>>> import re
>>> a = ">gi|124486857|ref|NP_001074751.1| ..."
>>> re.findall(r"(?:\|gb\|)([a-zA-Z0-9.]+)(?:\|)", a)
['EDL26483.1', 'AAI37799.1']
于 2013-02-13T21:10:27.997 回答