python - 如何识别基因组中特定位置的特征

Question

我有兴趣确定基因组特定位置的特征（即基因/cds）。例如，什么基因（如果有）包含位置 2,000,000。我知道如何通过for循环和遍历基因组中的每个特征来做到这一点（代码包含在下面），但作为随机化研究的一部分，这是我想做数亿次的事情，这将需要很多时间比我想要的更长。

下面包含的代码是我正在尝试做的更具体的示例：

from Bio import SeqIO
import random
GenomeSeq = SeqIO.read(open("reference_sequence.gbk", "r"), "genbank")

interesting_position = random.randint(0, len(GenomeSeq))

for feature in GenomeSeq.features:  # loop each position through whole genome
    # In this particular case I'm interested in focusing on cds, but
    # in others, I may be interested in other feature types?
    if feature.type == "CDS":  
        if (feature.location._start.position <= interesting_position and 
            feature.location._end.position >= interesting_position):
            try:
                print feature.qualifiers['gene']
            except KeyError:
                print feature

我考虑过制作一个字典，其中基因中的每个位置对应于一个键，并将特征 ID 作为值，因为查找会比循环更快，但似乎应该有一种方法可以做GenomeSeq[interestion_position].qualifiers['gene']

score 1 · Accepted Answer

很老了，但我会试一试。假设您要检查给定基因组的随机点列表（而不是许多基因组的固定点集）：

from Bio import SeqIO
import random

GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank")

# Here I generate a lot of random points in the genome in a set
interesting_positions = set(random.sample(xrange(1, len(GenomeSeq)), 20000))


for feature in GenomeSeq.features:

    if feature.type == "CDS":
        # Here I create a set for every position covered by a feature
        feature_span = set(xrange(feature.location._start.position,
                                  feature.location._end.position))

        # And this is the "trick": check if the two sets overlaps: if any point
        # is in the interesting_positions AND in the points covered by this
        # feature, we are interested in the feature.
        if feature_span & interesting_positions:
            try:
                print feature.qualifiers['gene']
            except KeyError:
                print feature

对于 20.000 个点和 4.7Mb 的基因组，循环在我的 crapy-2003 计算机中大约需要 3 秒，对于 200.000 个随机点需要 5 秒。

在分析和一些重构之后，我提取了所有只需要计算一次就可以得到的行：

from Bio import SeqIO
import random


def sample_genome(feature_spans, interesting_positions):
  for span in feature_spans:  
      if span & interesting_positions:
          # Do your business here
          print span[1].qualifiers.get("gene") or span[1]

if __name__ == "__main__":
    GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank")
    genome_length = len(GenomeSeq)

    features = [f for f in GenomeSeq.features if f.type == "CDS"]

    spans = [(set(
        xrange(feature.location._start.position,
               feature.location._end.position)), feature)
        for feature in features]

    for i in range(20):
        positions = set(random.sample(xrange(1, genome_length), 200))
        sample_genome(spans, positions)

每个样本大约需要 0.2 秒 + 4 秒读取/准备基因组。在我的这台旧机器上完成 1.000.000 个样本大约需要 50 个小时。作为一个随机抽样，在多核机器上启动一些进程，你明天就完成了。

score 1 · Accepted Answer

我从未使用过 BioPython，但我发现它位于其文档中：http: //biopython.org/wiki/SeqIO

from Bio import SeqIO
handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
print records[0].id  #first record
print records[-1].id #last record

那是你要找的吗？

python - 如何识别基因组中特定位置的特征

2 回答 2

Related

Reference