我认为您正在尝试解决最长的公共子序列问题。这个问题涉及试图找到将一个字符串转换为另一个字符串所需的最少修改量。
你说你正在尝试编写一个算法来做到这一点。看看 LCS 问题,如果您想使用自己的算法,请尝试用谷歌搜索它,或者您可以利用命令行实用程序 diff。
就像最长的常见子序列问题一样,diff 需要两个文件并找到一系列添加和删除,从而将 file1 转换为 file2 的修改最少。Diff 非常有效,我想它对于您的目的来说足够快。我很确定大多数差异的空间和时间复杂度为 O(Nlog(N)) 或更小,但您可能想自己验证这一点。更多关于这里的差异http://en.wikipedia.org/wiki/Diff_utility
我写了一个小 Python 程序,它使用 diff 来查找最长的连续公共子序列。这适用于 unix,但我确信您使用的任何平台都有一个 diff 实用程序。
这些文件预计每行有一个字符。您可能必须编写一个程序来对您的文件执行该转换。
import sys
import subprocess
import os
import re
def getLinesFromShellCommand(command):
p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=os.getcwd())
lines = []
for curLine in p.stdout.readlines():
lines.append(curLine)
retVal = p.wait()
return (retVal, lines)
#We need to find the longest sequence of lines that start with a space( a space meant the line was in common).
#We could use some kind of while loop to detect the start and end of a group of lines that start with spaces.
#However there is a much simpler method. Create a string by concatenating the first char from each line and use regex
#to find all the subsequences that start with spaces. After that just take the longest subsequence.
def findLongestCommonSubsequence(diffOutput):
outputFirstCharStr = reduce(lambda x, y: x+y[:1], diffOutput, "")
commonSubsequences = [(m.start(0), m.end(0)) for m in re.finditer(" +", outputFirstCharStr)]
longestCommonSubsequence = max(commonSubsequences, key=lambda (start,end) : end - start)
return longestCommonSubsequence
def main():
if len(sys.argv) != 3:
sys.stderr.write("usage: python LongestCommonSubsequence.py <file1> <file2>\n")
sys.exit(1)
commandOutput = getLinesFromShellCommand("diff -u {0} {1}".format(sys.argv[1], sys.argv[2]))
if commandOutput[0] != 1: # apparently diff returns 1 if its successful
sys.stderr.write("diff failed with input files.\n")
sys.exit(1)
diffOutput = commandOutput[1]
diffOutput = diffOutput[3:] # the first three lines are not needed
longestCommonSubsequence = findLongestCommonSubsequence(diffOutput)
print "Indices:",longestCommonSubsequence
for i in range(longestCommonSubsequence[0], longestCommonSubsequence[1]):
print diffOutput[i],
if __name__ == "__main__":
main()
用法
python LongestCommonSubsequence.py f1.txt f2.txt
如果 f1.txt 和 f2.txt 是您给出的第二个示例,则输出:
Indices: (5, 7)
T
C
编辑:我看到了你关于为什么上述方法不起作用的评论。您可能对这篇文章感兴趣:https ://cs.stackexchange.com/questions/2519/efficiently-calculating-minimum-edit-distance-of-a-smaller-string-at-each-positi