python - biopython 对齐的无限制索引

Question

我第一次使用 biopython。如果这是一个基本问题，请原谅我。

我想输入序列，然后对齐它们，并且能够参考原始序列（未加间隙）和对齐序列（间隙）的索引位置。

我现实世界的例子是烯醇化酶（Uniprot P37869和P0A6P9）。底物结合赖氨酸在大肠杆菌中的索引为 392，在枯草芽孢杆菌中为 389。如果对两者进行肌肉对齐，则该赖氨酸在对齐中的索引为 394。我希望能够在 gappend 和 unapped indicies 之间轻松转换。

示例 1：大肠杆菌残基 #392 的对齐索引是多少？（对齐序列中的答案 394）。

示例 2我在 394 的比对中发现了一个保守的残基。它在原始（未加缺口的）序列中的什么位置？（大肠杆菌中的答案 392 和枯草芽孢杆菌中的 389）。

>>>cline = MuscleCommandline(input="enolase.txt", out="enolase.aln")
>>>cline()

>>> alignment = AlignIO.read("enolase.aln","fasta")
>>> print(alignment[:,390:])
SingleLetterAlphabet() alignment with 2 rows and 45 columns
AGQIKTGAPSRTDRVAKYNQLLRIEDQLAETAQYHGINSFYNLNK sp|P37869|ENO_BACSU
AGQIKTGSMSRSDRVAKYNQLIRIEEALGEKAPYNGRKEIKGQA- sp|P0A6P9|ENO_ECOLI
>>> print(alignment[:,394])
KK

score 2 · Accepted Answer

有趣的问题！据我所知，没有内置的东西BioPython。这是我将如何解决它。

让我们从您的示例 2 开始。如果您获取两个文件enolase.txt，enolase.aln并分别使用 FASTA 格式的原始未对齐序列和对齐序列，那么我们可以遍历压缩记录，计算对齐序列中的间隙数并计算索引未封闭序列中的残基：

from Bio import SeqIO, AlignIO

def find_in_original(index, original_path, alignment_path):
    alignment = AlignIO.read(alignment_path, 'fasta')
    original = SeqIO.parse(original_path, 'fasta')

    for original_record, alignment_record in zip(original, alignment):
        alignment_seq = str(alignment_record.seq)
        original_seq = str(original_record.seq)

        gaps = alignment_seq[:index].count('-')
        original_index = len(alignment_seq[:index]) - gaps
        assert alignment_seq[index] ==  original_seq[original_index]
        yield ("The conserved residue {} at location {} in the alignment can be"
               " found at location {} in {}.".format(alignment_seq[index],
               index, original_index, original_record.id.split('|')[-1]))

结果如下：

>>> for result in  find_in_original(394, 'enolase.txt', 'enolase.aln'):
...     print result
The conserved residue K at location 394 in the alignment can be found at location 392 in ENO_ECOLI.
The conserved residue K at location 394 in the alignment can be found at location 389 in ENO_BACSU.

对于反向操作，我们查看对齐中所有可能的索引，如果我们减去间隙，看看哪一个等于未加间隙的序列：

def find_in_alignment(index, organism, original_path, alignment_path):
    alignment = AlignIO.read(alignment_path, 'fasta')
    original = SeqIO.parse(original_path, 'fasta')

    for original_record, alignment_record in zip(original, alignment):
        if organism in original_record.id:
            alignment_seq = str(alignment_record.seq)
            original_seq = str(original_record.seq)

            residue = original_seq[index]
            for i in range(index, len(alignment_seq)):
                if alignment_seq[i] == residue and \
                   alignment_seq[:i].replace('-', '') == original_seq[:index]:
                    return ("The residue {} at location {} in {} is at location"
                            " {} in the alignment.".format(residue, index,
                            organism, i))

结果如下：

>>> print find_in_alignment(392, 'ENO_ECOLI', 'enolase.txt', 'enolase.aln')
The residue K at location 392 in ENO_ECOLI is at location 394 in the alignment.
>>> print find_in_alignment(389, 'ENO_BACSU', ungapped_path, alignment_path)
The residue K at location 389 in ENO_BACSU is at location 394 in the alignment.

score 0 · Accepted Answer

def original_index_to_aln_index(aln_seq, index):
    ''' given aln seq and original 0-based index, return new index'''
    pos = index 
    n_char_before = pos - aln_seq[:pos].count('-') # how many character before position
    while n_char_before < index: # while it has not reach the number of char before
        missed_char = index - n_char_before
        pos = pos +  missed_char# if it has not reach, shift at least lacked char
        n_char_before = pos - aln_seq[:pos].count('-')
        
    
    while aln_seq[pos] == '-': # if the last character is gap, shift 1 until meet char
        pos += 1
    return pos
    ```

python - biopython 对齐的无限制索引

2 回答 2

Related

Reference