首先,这里是元组列表中的所有数据:
>>> txt='''\
... gene_name start_pos end_pos
... gene_A 100 200
... gene_B 300 400
... gene_C 500 600
... gene_D 700 800
... gene_E 900 1000'''
>>>
>>> genes=[(name, int(d1), int(d2)) for name, d1, d2 in [line.split() for line in txt.splitlines()[1:]]]
>>> genes
[('gene_A', 100, 200), ('gene_B', 300, 400), ('gene_C', 500, 600), ('gene_D', 700, 800), ('gene_E', 900, 1000)]
一旦你有了这个,对于你的简单例子,你可以使用过滤器:
def query(genes, start, finish):
return list(filter(lambda t: t[1]<start<t[2] and t[1]<finish<t[2], genes))
>>> query(genes, 550, 580)
[('gene_C', 500, 600)]
>>> query(genes, 110, 180)
[('gene_A', 100, 200)]
或列表理解:
def query(genes, start, finish):
return [t[0] for t in genes if t[1]<start<t[2] and t[1]<finish<t[2]]
>>> query(genes, 550, 580)
['gene_C']
>>> query(genes, 110, 180)
['gene_A']
或者您可以使用bisect 模块(如果基因是排序列表)。
首先对列表进行排序:
>>> genes.sort(key=lambda t: (t[1], t[2]))
>>> genes
[('gene_A', 100, 200), ('gene_B', 300, 400), ('gene_C', 500, 600), ('gene_D', 700, 800), ('gene_E', 900, 1000)]
生成可用作索引的键元组列表:
>>> keys=[(t[1], t[2]) for t in genes]
>>> keys
[(100, 200), (300, 400), (500, 600), (700, 800), (900, 1000)]
现在您可以使用键索引和 bisect 查询基因:
>>> import bisect
>>> genes[bisect.bisect_left(keys, (550, 580))-1]
('gene_C', 500, 600)
>>> genes[bisect.bisect_left(keys, (110, 180))-1]
('gene_A', 100, 200)
对于更复杂的示例,您可以考虑SortedCollection配方。