python - 从 python 的 fasta 标头中的相应 GI 编号获取 NCBI 的入藏号

Question

我一直在 Genbank 上看到警告说他们正在逐步淘汰 GI 编号，并保存了许多 fasta 文件，我以以下格式编辑了标题：

>SomeText_ginumber

我什至不知道从哪里开始，但有没有一种方法，最好是使用 python，我可以从 NCBI 获取每个 gi 的相应登录号，并输出一个带有如下标题的文件：

>SomeText_accessionnumber

这是文件格式的另一个示例：

>Desulfovibrio_fructosivorans_ferredoxin_492837709
MTKSVAPTVTIDGKVVPIEGERNLLELIRKVRIDLPTFCYHSELSVYGACRLCLVEVKNRGIMGAC
>Oxalobacteraceae_bacterium_AB_14_hypothetical_protein_522195384
MIKLTVNGIPVEVDEGATYLDAANKAGVHIPTLCYHPRFRSHAVCRMCLVHVAGSSRPQAACIGKA

编辑/更新：

from Bio import Entrez
from time import sleep
import sys
import re
Entrez.email = ''

seqs = open(sys.argv[1],"r")

for line in seqs:
    if line.startswith('>'):
        gi = re.findall("\d{5,}",line)
        matches = len(gi)
        #print(matches)
        if matches < 2:
            if gi:
                handle = Entrez.efetch(db="nucleotide", id=gi, retmode="xml")
                records = Entrez.read(handle)
                print(line[0:line.rfind('_') + 1] + records[0]['GBSeq_primary-accession'])
                sleep(1)
        elif matches >= 2:
             print("Error - More than one ginumber in header!")
             break
        else:
            seq=line.rstrip()
            print(seq)
    else:
        seq1=line.rstrip()
        print(seq1)

score 1 · Accepted Answer

尝试使用BioPython。

以下代码段应该可以帮助您入门。首先从标题中获取 GI（下划线之后的标题部分），从 GenBank 中获取数据，打印旧标题但带有入藏号，然后是其余的输入序列，完成:)

这适用于您的两个示例，但可能会因更多数据而失败（缺少 GI 等）。登录号也有下划线，就像你的标题一样，稍后会使解析复杂化。也许用其他东西替换下划线或添加另一个分隔符。

from Bio import Entrez
from time import sleep
Entrez.email = 'your@email'

seqs = """>Desulfovibrio_fructosivorans_ferredoxin_492837709
MTKSVAPTVTIDGKVVPIEGERNLLELIRKVRIDLPTFCYHSELSVYGACRLCLVEVKNRGIMGAC
>Oxalobacteraceae_bacterium_AB_14_hypothetical_protein_522195384
MIKLTVNGIPVEVDEGATYLDAANKAGVHIPTLCYHPRFRSHAVCRMCLVHVAGSSRPQAACIGKA"""


for line in seqs.splitlines():
    if line.startswith('>'):
        gi = line[line.rfind('_') + 1:]
        handle = Entrez.efetch(db="nucleotide", id=gi, retmode="xml")
        records = Entrez.read(handle)
        print(line[0:line.rfind('_') + 1] + records[0]['GBSeq_primary-accession'])
        sleep(1)
    else:
        print(line)

python - 从 python 的 fasta 标头中的相应 GI 编号获取 NCBI 的入藏号

1 回答 1

Related

Reference