我正在使用这个脚本:
import csv
import time
import sys
from ete3 import NCBITaxa
ncbi = NCBITaxa()
def get_desired_ranks(taxid, desired_ranks):
lineage = ncbi.get_lineage(taxid)
names = ncbi.get_taxid_translator(lineage)
lineage2ranks = ncbi.get_rank(names)
ranks2lineage = dict((rank,taxid) for (taxid, rank) in lineage2ranks.items())
return{'{}_id'.format(rank): ranks2lineage.get(rank, '<not present>') for rank in desired_ranks}
if __name__ == '__main__':
file = open(sys.argv[1], "r")
taxids = []
contigs = []
for line in file:
line = line.split("\n")[0]
taxids.append(line.split(",")[0])
contigs.append(line.split(",")[1])
desired_ranks = ['superkingdom', 'phylum']
results = list()
for taxid in taxids:
results.append(list())
results[-1].append(str(taxid))
ranks = get_desired_ranks(taxid, desired_ranks)
for key, rank in ranks.items():
if rank != '<not present>':
results[-1].append(list(ncbi.get_taxid_translator([rank]).values())[0])
else:
results[-1].append(rank)
i = 0
for result in results:
print(contigs[i] + ','),
print(','.join(result))
i += 1
file.close()
该脚本从文件中提取样本,并从 NCBI 分类数据库的本地副本中获取它们各自的谱系。奇怪的是,当我在少量出租车(~70,~100)上运行该脚本时,它运行良好,但我的大多数数据集都超过 280k 出租车,这些破坏了脚本。
我得到这个完整的错误:
Traceback (most recent call last):
File "/data1/lstout/blast/scripts/getLineageByETE3.py", line 31, in <module>
ranks = get_desired_ranks(taxid, desired_ranks)
File "/data1/lstout/blast/scripts/getLineageByETE3.py", line 11, in get_desired_ranks
lineage = ncbi.get_lineage(taxid)
File "/data1/lstout/.local/lib/python2.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 227, in get_lineage
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %taxid)
sqlite3.Warning: You can only execute one statement at a time.
回溯中的前两个文件只是我上面引用的脚本,第三个文件是 ete3 的一个。正如我所说,该脚本适用于小型数据集。
我试过的:
- 导入时间模块并在第 11 行和第 31 行我的违规代码行之前/之后休眠几毫秒/百分之一秒。没有效果。
转到ete3代码中的第227行......
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %merged_conversion[taxid])
并将“执行”功能更改为“执行脚本”,以便能够一次处理多个查询(这似乎是问题所在)。这产生了一个新的错误,并导致我在他们的脚本中更改一些小东西试图让它起作用。没有结果。这是完整的违规功能:
def get_lineage(self, taxid):
"""Given a valid taxid number, return its corresponding lineage track as a
hierarchically sorted list of parent taxids.
"""
if not taxid:
return None
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %taxid)
raw_track = result.fetchone()
if not raw_track:
#perhaps is an obsolete taxid
_, merged_conversion = self._translate_merged([taxid])
if taxid in merged_conversion:
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %merged_conversion[taxid])
raw_track = result.fetchone()
# if not raise error
if not raw_track:
#raw_track = ["1"]
raise ValueError("%s taxid not found" %taxid)
else:
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
track = list(map(int, raw_track[0].split(",")))
return list(reversed(track))
令我困扰的是,这适用于少量数据!我正在我学校的高性能计算机上运行这些脚本,并尝试在他们的头节点和交互式 moab 调度程序中运行。没有任何帮助。