parsing - 使用 Excel 中的 ID 列表以 fasta 格式保存来自 NCBI 的序列

Question

我对使用 python 还很陌生，我喜欢它。但是我被这个问题困住了，我希望你能给我一个关于我所缺少的东西。

我在一个 excel 文件中有一个基因 ID 列表，我正在尝试使用 xrld 和 biopython 来检索序列并将我的结果（以 fasta 格式）保存到文本文档中。到目前为止，我的代码允许我在 shell 中查看结果，但它只将最后一个序列保存在文档中。

这是我的代码：

import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
    if sh.row(rx)[0].value:
        from Bio import Entrez
        from Bio import SeqIO
        Entrez.email = "mail@xxx.com"
        in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
        record = SeqIO.parse(in_handle, "fasta")
        for record in SeqIO.parse(in_handle, "fasta"):
            print record.format("fasta")
        out_handle = open("example.txt", "w")
        SeqIO.write(record, out_handle, "fasta")
        in_handle.close()
        out_handle.close()

正如我所提到的，文件“example.txt”只有最后一个显示外壳的序列（fasta 格式）。

谁能帮我在同一个文档中获取我从 NCBI 检索到的所有序列？

非常感谢

安东尼奥

score 0 · Accepted Answer

快到了我的朋友们！

主要问题是您的 For 循环会在每个循环中不断关闭文件。我还修复了一些应该加快代码速度的小问题（例如，您在每个循环中不断导入 Bio）。

使用这个新代码：

out_handle = open("example.txt", "w")
import xlrd
import re
from Bio import Entrez
from Bio import SeqIO
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
    if sh.row(rx)[0].value:
        Entrez.email = "mail@xxx.com"
        in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=rx)
        record = SeqIO.parse(in_handle, "fasta")
        SeqIO.write(record, out_handle, "fasta")
        in_handle.close()
out_handle.close()

如果还是报错，那一定是你的excel文件有问题。如果错误仍然存在，请将此发送给我，我会提供帮助:)

score 0 · Accepted Answer

我对 python 也很陌生，也很喜欢它！这是我第一次尝试回答问题，但也许是因为你的循环结构和“w”模式？也许尝试将 ("example.txt", "w") 更改为附加模式 ("example.txt", "a")，如下所示？

import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
    if sh.row(rx)[0].value:
        from Bio import Entrez
        from Bio import SeqIO
        Entrez.email = "mail@xxx.com"
        in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
        record = SeqIO.parse(in_handle, "fasta")
        for record in SeqIO.parse(in_handle, "fasta"):
            print record.format("fasta")
        out_handle = open("example.txt", "a")
        SeqIO.write(record, out_handle, "fasta")
        in_handle.close()
        out_handle.close()

parsing - 使用 Excel 中的 ID 列表以 fasta 格式保存来自 NCBI 的序列

2 回答 2

Related

Reference