python - 任意字符串与定向字符的多序列比较

Question

核心问题是：

我正在寻找一种算法来计算一组字符串之间的最大简约距离。对于距离，我的意思类似于Damerau-Levenshtein 距离，即字符或相邻字符块的删除、插入、替换和转置的最少数量。但是，我想研究带有定向字符的字符串，而不是常规字符串。

因此，字符串可能如下所示：

(A,1) (B,1) (C,1) (D,1)

可能的衍生物可能是：

(A,1) (C,0) (B,0) (D,1)
(A,1) (C,1) (B,1) (D,1)
(A,1) (B,0) (C,0) (D,1)

A,B,C,D字符标识和1 = forward和在哪里0 = reverse。

在这里，导数 1. 的距离为 2，因为您可以剪切块 BC 并将其反向粘贴（1 剪切，1 粘贴）。导数 2. 也有 2，因为您可以剪下 C 并将其重新粘贴到 B 前面（1 次剪切，1 次粘贴），而数字 3. 需要 4 次操作（2 次剪切，2 次粘贴）进行转换。类似地，删除或插入块将产生距离 1。

如果您将(X,0)and定义为所有可能的 X(X,1)的两个不同的非定向字符(X0, X1)，则示例 3. 将导致距离为 2，因为您可以分两步切出块B1C1并插入块。B0C0

一个真实世界的例子：

一个细菌基因组中的基因可以被认为是定向特征(A,0),(B,0)...通过确定序列距离，两个相关细菌中同源基因的基因组方向可以作为进化标记轨迹. 细菌基因组是圆形字符串这一事实引入了额外的边界条件 ABC 等于 BCA。

真正的基因组确实具有独特的基因，在伴侣中没有同等基因，因此产生了占位符字符@。这些占位符将比较的信息内容减少到下限，因为例如 (A,1)(B,1)@(C,1) 可以转换为 (A,1)@@@(B,1) @(C,1) 通过插入块@@@。然而，方向部分恢复了信息内容，因为您可能会发现 (A,1)@@@(B,0)@(C,1) 表示最小距离为 3。更好的是比较多个相关序列的算法（基因组）同时进行，因为您可以在进化历史中找到中间体，从而提高分辨率。

我意识到，在文本字符串比较上已经发布了几个问题。但是它们无法轻松扩展以包含方向。此外，存在大量处理生物序列的方法，特别是多序列分析。然而，这些仅限于不以交替方向存在的大分子序列，并且通常为任何特定字符匹配调用特定权重。

如果已经存在一个允许必要的定制来解决该问题的 python 库，那就太好了。但是任何合适的方向感知算法都会非常有帮助。

score 1 · Accepted Answer

我相信以下代码可以帮助您：

from itertools import permutations
from random import randint
from pprint import pprint


def generate_genes():
    """
    Generates a boilerplate list of genes
    @rtype : list
    """
    tuple_list = []

    for i in range(16):

        binary_var = bin(i)[2:]

        if len(binary_var) != 4:
            binary_var = "0" * (4 - len(binary_var)) + binary_var

        tuple_list.append([('A', (1 if binary_var[0] == '1' else 0)),
                           ('B', (1 if binary_var[1] == '1' else 0)),
                           ('C', (1 if binary_var[2] == '1' else 0)),
                           ('D', (1 if binary_var[3] == '1' else 0))])

    return tuple_list


def all_possible_genes():
    """ Generates all possible combinations of ABCD genes
    @return: returns a list of combinations
    @rtype: tuple
    """
    gene_list = generate_genes()
    all_possible_permutations = []
    for gene in gene_list:
        all_possible_permutations.append([var for var in permutations(gene)])

    return all_possible_permutations


def gene_stringify(gene_tuple):
    """
    @type gene_tuple : tuple
    @param gene_tuple: The gene tuple generated
    """

    return "".join(str(var[0]) for var in gene_tuple if var[1])


def dameraulevenshtein(seq1, seq2):
    """Calculate the Damerau-Levenshtein distance between sequences.

    This distance is the number of additions, deletions, substitutions,
    and transpositions needed to transform the first sequence into the
    second. Although generally used with strings, any sequences of
    comparable objects will work.

    Transpositions are exchanges of *consecutive* characters; all other
    operations are self-explanatory.

    This implementation is O(N*M) time and O(M) space, for N and M the
    lengths of the two sequences.

    >>> dameraulevenshtein('ba', 'abc')
    2
    >>> dameraulevenshtein('fee', 'deed')
    2

    It works with arbitrary sequences too:
    >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
    2
    """
    # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
    # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
    # However, only the current and two previous rows are needed at once,
    # so we only store those.
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        # Python lists wrap around for negative indices, so put the
        # leftmost column at the *end* of the list. This matches with
        # the zero-indexed strings and saves extra calculation.
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            # This block deals with transpositions
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[len(seq2) - 1]


if __name__ == '__main__':
    genes = all_possible_genes()
    list1 = genes[randint(0, 15)][randint(0, 23)]
    list2 = genes[randint(0, 15)][randint(0, 23)]

    print gene_stringify(list1)
    pprint(list1)
    print gene_stringify(list2)
    pprint(list2)
    print dameraulevenshtein(gene_stringify(list1), gene_stringify(list2))

学分

算法的迈克尔荷马

python - 任意字符串与定向字符的多序列比较

1 回答 1

学分

Related

Reference