python - 计算两个已经比对的序列之间的百分比同一性

Question

我在这样的文件中有两个字符串：

>1
atggca---------gtgtggcaatcggcacat
>2
atggca---------gtgtggcaatcggcacat

在 Biopython 中使用 AlignIO 函数：

from Bio import AlignIO
print AlignIO.read("neighbor.fas", "fasta")

返回这个：

SingleLetterAlphabet() alignment with 2 rows and 33 columns
atggca---------gtgtggcaatcggcacat 1
atggca---------gtgtggcaatcggcacat 2

我想计算此对齐中两行之间的百分比同一性。

row = align[:,n]

允许提取可以比较的单个列。

不应计算仅包含“-”的列。

score 4 · Accepted Answer

这是一个快速但在生物学上不准确的答案。

使用 Levenshtein Python 扩展和 C 库。

http://code.google.com/p/pylevenshtein/

Levenshtein Python C 扩展模块包含用于快速计算 - Levenshtein（编辑）距离和编辑操作 - 字符串相似度 - 近似中值字符串，通常是字符串平均 - 字符串序列和集合相似度的函数它支持普通字符串和 Unicode 字符串。

既然这些序列是字符串，为什么不呢！

sudo pip install python-Levenshtein

然后启动 ipython：

In [1]: import Levenshtein

In [3]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacat'.replace('-','')) * 100
Out[3]: 100.0

In [4]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacaa'.replace('-','')) * 100
Out[4]: 95.83333333333334

score 1 · Accepted Answer

如果您想将其扩展到两个以上的序列，则以下效果很好，它给出的结果与 BioPerl 的整体百分比标识函数 ( http://search.cpan.org/dist/BioPerl/Bio/SimpleAlign.pm ) 相同。

from Bio import AlignIO

align = AlignIO.read("neighbor.fas", "fasta")
print perc_identity(align)

def perc_identity(aln):
    i = 0
    for a in range(0,len(aln[0])):
        s = aln[:,a]
        if s == len(s) * s[0]:
            i += 1
    return 100*i/float(len(aln[0]))

score 0 · Accepted Answer

我知道这个问题很老但是，因为你已经在 biopython，你不能跟着 BLAST 记录类一起移动吗（教程的第 7 章http://biopython.org/DIST/docs/tutorial/Tutorial.html )?

我相信您需要的选项（在“7.4 BLAST 记录类”下）是“hsp.identities”。

python - 计算两个已经比对的序列之间的百分比同一性

3 回答 3

Related

Reference