python - 实现 Waterman-Eggert 算法

Question

我正在尝试实现 Waterman-Eggert 算法以查找次优的局部序列比对，但我正在努力理解如何“分解”单独的比对组。我有基本的 Smith-Waterman 算法工作正常。

将以下序列与自身对齐的简单测试：

'HEAGHEAGHEAG'
'HEAGHEAGHEAG'

产生如下的 fMatrix：

 [[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   8.   0.   0.   0.   8.   0.   0.   0.   8.   0.   0.   0.]
  [  0.   0.  13.   0.   0.   0.  13.   0.   0.   0.  13.   0.   0.]
  [  0.   0.   0.  17.   0.   0.   0.  17.   0.   0.   0.  17.   0.]
  [  0.   0.   0.   0.  23.   0.   0.   0.  23.   0.   0.   0.  23.]
  [  0.   8.   0.   0.   0.  31.   0.   0.   0.  31.   0.   0.   0.]
  [  0.   0.  13.   0.   0.   0.  36.   0.   0.   0.  36.   0.   0.]
  [  0.   0.   0.  17.   0.   0.   0.  40.   0.   0.   0.  40.   0.]
  [  0.   0.   0.   0.  23.   0.   0.   0.  46.   0.   0.   0.  46.]
  [  0.   8.   0.   0.   0.  31.   0.   0.   0.  54.   4.   0.   0.]
  [  0.   0.  13.   0.   0.   0.  36.   0.   0.   4.  59.   9.   0.]
  [  0.   0.   0.  17.   0.   0.   0.  40.   0.   0.   9.  63.  13.]
  [  0.   0.   0.   0.  23.   0.   0.   0.  46.   0.   0.  13.  69.]]

为了找到次优对齐，例如

'HEAGHEAGHEAG    '
'    HEAGHEAGHEAG'

您必须首先删除最佳对齐方式（即沿主对角线）并重新计算 fMatrix；这被称为“去簇”，其中对齐的“簇”被定义为路径相交/共享一对或多对对齐残基的任何对齐。除了 fMatrix 之外，还有一个二级矩阵，其中包含有关构造 fMatrix 的方向的信息。

构建 fMatrix 和回溯矩阵的代码片段如下：

# Generates fMatrix.
for i in range(1, length):
    for j in range(1, length):
        matchScore = fMatrix[i-1][j-1] + simMatrixDict[seq[i-1]+seq[j-1]]
        insScore = fMatrix[i][j-1] + gap
        delScore = fMatrix[i-1][j] + gap
        fMatrix[i][j] = max(0, matchScore, insScore, delScore)

        # Generates matrix for backtracking.
        if fMatrix[i][j] == matchScore:
            backMatrix[i][j] = 2      
        elif fMatrix[i][j] == insScore:
            backMatrix[i][j] = 3        # INSERTION in seq - Horizontal
        elif fMatrix[i][j] == delScore:
            backMatrix[i][j] = 1        # DELETION in seq - Vertical

        if fMatrix[i][j] >= backtrackStart: 
            backtrackStart = fMatrix[i][j]
            endCoords = i, j                
return fMatrix, backMatrix, endCoords

为了消除这个最佳对齐，我尝试使用这个 backMatrix 来回溯 fMatrix（根据原始的 Smith-Waterman 算法）并随时设置fMatrix[i][j] = 0，但这不会删除整个团块，只会移除那个团块中的精确对齐.

对于一些背景信息，Smith-Waterman 算法的Wikipedia页面解释了 fMatrix 是如何构建的，并且这里有关于回溯如何工作的解释。Waterman-Eggert 算法在这里大致解释。

谢谢。

score 1 · Accepted Answer

好的。这是一些代码来做你想做的事。我使用了漂亮的打印库 ( pprint)，所以输出看起来不错。（如果矩阵中的数字是单个数字，它看起来最好，但是如果有多个数字，对齐会有点混乱。）

它是如何工作的？

因为你只需要改变主对角线上和上下对角线上的数字，我们只需要一个 for 循环。matrix[i][i]总是在主对角线上，因为它是i向下的行，i跨列的。matrix[i][i-1]总是下相邻的对角线，因为它是i向下的行，i-1跨列的。matrix[i-1][i]始终是上相邻的对角线，因为它是i-1向下的行，并且是i跨行的。

#!/usr/bin/python
import pprint

matrix = [
    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,],
    [  0,   8,   0,   0,   0,   8,   0,   0,   0,   8,   0,   0,   0,],
    [  0,   0,  13,   0,   0,   0,  13,   0,   0,   0,  13,   0,   0,],
    [  0,   0,   0,  17,   0,   0,   0,  17,   0,   0,   0,  17,   0,],
    [  0,   0,   0,   0,  23,   0,   0,   0,  23,   0,   0,   0,  23,],
    [  0,   8,   0,   0,   0,  31,   0,   0,   0,  31,   0,   0,   0,],
    [  0,   0,  13,   0,   0,   0,  36,   0,   0,   0,  36,   0,   0,],
    [  0,   0,   0,  17,   0,   0,   0,  40,   0,   0,   0,  40,   0,],
    [  0,   0,   0,   0,  23,   0,   0,   0,  46,   0,   0,   0,  46,],
    [  0,   8,   0,   0,   0,  31,   0,   0,   0,  54,   4,   0,   0,],
    [  0,   0,  13,   0,   0,   0,  36,   0,   0,   4,  59,   9,   0,],
    [  0,   0,   0,  17,   0,   0,   0,  40,   0,   0,   9,  63,  13,],
    [  0,   0,   0,   0,  23,   0,   0,   0,  46,   0,   0,  13,  69,]]

print "Original Matrix"
pprint.pprint(matrix)
print

for i in range(len(matrix)):
    matrix[i][i] = 0
    if (i > 0) and (i < (len(matrix))):
        matrix[i][i-1] = 0
        matrix[i-1][i] = 0

print "New Matrix"
pprint.pprint(matrix)

输出：

Original Matrix
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 17, 0, 0, 0, 17, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 23, 0, 0, 0, 23],
 [0, 8, 0, 0, 0, 31, 0, 0, 0, 31, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 36, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 40, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 46],
 [0, 8, 0, 0, 0, 31, 0, 0, 0, 54, 4, 0, 0],
 [0, 0, 13, 0, 0, 0, 36, 0, 0, 4, 59, 9, 0],
 [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 9, 63, 13],
 [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 13, 69]]

New Matrix
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0, 17, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0, 0, 23],
 [0, 8, 0, 0, 0, 0, 0, 0, 0, 31, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 0, 0, 0, 0, 36, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 0, 0, 0, 0, 40, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 0, 0, 0, 0, 46],
 [0, 8, 0, 0, 0, 31, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 0]]

python - 实现 Waterman-Eggert 算法

1 回答 1

它是如何工作的？

输出：

Related

Reference