在管道上拆分|
,然后跳过所有内容,直到第一个gb
;下一个元素是 ID:
from itertools import dropwhile
text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)
示范:
>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'
换句话说,不需要正则表达式。
将其变成生成器方法以获取所有 ID:
from itertools import dropwhile
def extract_ids(text):
text = iter(text.split('|'))
while True:
next(dropwhile(lambda s: s != 'gb', text))
yield next(text)
这给出了:
>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']
或者你可以在一个简单的循环中使用它:
for id in extract_ids(text):
print id