algorithm - 在另一个序列中找到最相似的子序列

Question

我需要编写一个算法，在 N log(N) 中找到 S1 中与另一个字符串 S2 最相似的子字符串（S1 中与 S2 具有最小汉明距离的子字符串，换句话说），其中 N = len(S1) +长度（S2），长度（S2）<=长度（S1）。

例如：
S1 = AGTCAGTC
S2 = GTC
答案：GTC（距离 0）

S1 = AAGGTTCC
S2 = TCAA
答案：TTCC（距离 3）

时间复杂度不得超过 O(N Log(N))。空间复杂度无关紧要。

LCS（最长公共子序列）在我的情况下不起作用。例如：

    S1 = GAATCCAGTCTGTCT
    S2 = AATATAT

    LCS 答案：AATCCAGTC
    GAATCCAGTCTGTCT
     |||
     阿塔塔特

    正确答案：AGCTGTCT
    GAATCCAGTCTGTCT
          | | | | |
          阿塔塔特

score 1 · Accepted Answer

我认为您正在尝试解决最长的公共子序列问题。这个问题涉及试图找到将一个字符串转换为另一个字符串所需的最少修改量。

你说你正在尝试编写一个算法来做到这一点。看看 LCS 问题，如果您想使用自己的算法，请尝试用谷歌搜索它，或者您可以利用命令行实用程序 diff。

就像最长的常见子序列问题一样，diff 需要两个文件并找到一系列添加和删除，从而将 file1 转换为 file2 的修改最少。Diff 非常有效，我想它对于您的目的来说足够快。我很确定大多数差异的空间和时间复杂度为 O(Nlog(N)) 或更小，但您可能想自己验证这一点。更多关于这里的差异http://en.wikipedia.org/wiki/Diff_utility

我写了一个小 Python 程序，它使用 diff 来查找最长的连续公共子序列。这适用于 unix，但我确信您使用的任何平台都有一个 diff 实用程序。

这些文件预计每行有一个字符。您可能必须编写一个程序来对您的文件执行该转换。

import sys
import subprocess
import os
import re

def getLinesFromShellCommand(command):
        p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=os.getcwd())
        lines = []
        for curLine in p.stdout.readlines():
                lines.append(curLine)
        retVal = p.wait()
        return (retVal, lines)

#We need to find the longest sequence of lines that start with a space( a space meant the line was in common).
#We could use some kind of while loop to detect the start and end of a group of lines that start with spaces. 
#However there is a much simpler method. Create a string by concatenating the first char from each line and use regex
#to find all the subsequences that start with spaces. After that just take the longest subsequence.
def findLongestCommonSubsequence(diffOutput):
    outputFirstCharStr = reduce(lambda x, y: x+y[:1], diffOutput, "")
    commonSubsequences = [(m.start(0), m.end(0)) for m in re.finditer(" +", outputFirstCharStr)]
    longestCommonSubsequence = max(commonSubsequences, key=lambda (start,end) : end - start)
    return longestCommonSubsequence

def main():
    if len(sys.argv) != 3:
        sys.stderr.write("usage: python LongestCommonSubsequence.py <file1> <file2>\n")
        sys.exit(1)
    commandOutput = getLinesFromShellCommand("diff -u {0} {1}".format(sys.argv[1], sys.argv[2]))
    if commandOutput[0] != 1: # apparently diff returns 1 if its successful
        sys.stderr.write("diff failed with input files.\n")
        sys.exit(1)
    diffOutput = commandOutput[1]
    diffOutput = diffOutput[3:] # the first three lines are not needed
    longestCommonSubsequence = findLongestCommonSubsequence(diffOutput)
    print "Indices:",longestCommonSubsequence
    for i in range(longestCommonSubsequence[0], longestCommonSubsequence[1]):
        print diffOutput[i],

if __name__ == "__main__":
    main()

用法

python LongestCommonSubsequence.py f1.txt f2.txt

如果 f1.txt 和 f2.txt 是您给出的第二个示例，则输出：

Indices: (5, 7)
 T
 C

编辑：我看到了你关于为什么上述方法不起作用的评论。您可能对这篇文章感兴趣：https ://cs.stackexchange.com/questions/2519/efficiently-calculating-minimum-edit-distance-of-a-smaller-string-at-each-positi

algorithm - 在另一个序列中找到最相似的子序列

1 回答 1

Related

Reference