python - Python SequenceMatcher Overhead - 100% CPU 利用率和非常慢的处理

Question

我正在使用 difflib 比较两个目录中的文件（连续几年的版本）。首先，我使用 filecmp 来查找已更改的文件，然后迭代地使用 difflib.SequenceMatcher 来比较它们并生成一个 html 差异，如此处所述。

但是，我发现程序运行时间太长，并且 python 正在使用 100% CPU。在时间分析中，我发现seqm.get_opcodes()调用一直在占用。

任何见解将不胜感激。谢谢！

代码：

#changed_set contains the files to be compared
for i in changed_set:
  oldLines = open(old_dir +"/" + i).read()
  newLines = open(new_dir +"/" + i).read()
  seqm = difflib.SequenceMatcher(lambda(x): x in string.whitespace, oldLines, newLines)
  opcodes = seqm.get_opcodes() #XXX: Lots of time spent in this !
  produceDiffs(seqm, opcodes)
  del seqm

score 3 · Accepted Answer

我的回答是完全不同的解决问题的方法：尝试使用 git 之类的版本控制系统来调查目录多年来的变化情况。

从第一个目录中创建一个存储库，然后将内容替换为下一年的目录并将其作为更改提交。（或将 .git 目录移动到下一年的目录，以节省复制/删除）。重复。

然后运行 gitk，您将能够看到树的任何两个修订版之间发生了什么变化。要么只是二进制文件发生了变化，要么是文本文件的差异。

score 1 · Accepted Answer

你也可以试试这个diff-match-patch库，根据我的经验，它可以快 10 倍。

编辑：在这里举例我的其他答案

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

python - Python SequenceMatcher Overhead - 100% CPU 利用率和非常慢的处理

2 回答 2

Related

Reference