9

I want to write a simple diff application in Python using Google's Diff Match Patch APIs. I'm quite new to Python, so I want an example of how to use the Diff Match Patch API for semantically comparing two paragraphs of text. I'm not too sure of how to go about using the diff_match_patch.py file and what to import to from it. Help will be much appreciated!

Additionally, I've tried using difflib, but I found it ineffective for comparing largely varied sentences. I'm using ubuntu 12.04 x64.

4

1 回答 1

21

Google 的diff-match-patch API对于所有实现它的语言(Java、JavaScript、Dart、C++、C#、Objective C、Lua 和 Python 2.x 或 python 3.x)都是相同的。因此,通常可以使用目标语言以外的语言的示例片段来确定各种差异/匹配/补丁任务需要哪些特定的 API 调用。

在简单的“语义”比较的情况下,这就是您所需要的

import diff_match_patch

textA = "the cat in the red hat"
textB = "the feline in the blue hat"

#create a diff_match_patch object
dmp = diff_match_patch.diff_match_patch()

# Depending on the kind of text you work with, in term of overall length
# and complexity, you may want to extend (or here suppress) the
# time_out feature
dmp.Diff_Timeout = 0   # or some other value, default is 1.0 seconds

# All 'diff' jobs start with invoking diff_main()
diffs = dmp.diff_main(textA, textB)

# diff_cleanupSemantic() is used to make the diffs array more "human" readable
dmp.diff_cleanupSemantic(diffs)

# and if you want the results as some ready to display HMTL snippet
htmlSnippet = dmp.diff_prettyHtml(diffs)


关于diff-match-patch进行“语义”处理的一个词
请注意,这种处理对于向人类观众呈现差异很有用,因为它倾向于通过避免文本的非相关重新同步来产生较短的差异列表(例如,当两个不同的单词恰好在它们的中间有共同的字母)。然而,产生的结果远非完美,因为这种处理只是基于差异长度和表面模式等的简单启发式算法,而不是基于词典和其他语义级设备的实际 NLP 处理。
例如,上面使用的textAandtextB值为diffs数组生成以下“before-and-after-diff_cleanupSemantic”值

[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')]
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'red'), (1, 'blue'), (0, ' hat')]

好的!红色和蓝色共有的字母“e”导致 diff_main() 将此文本区域视为四个编辑,但 cleanupSemantic() 仅修复两个编辑,很好地挑选出不同的语义 'blue' 和 '红色的'。

但是,如果我们有,例如

textA = "stackoverflow is cool"
textb = "so is very cool"

产生的前/后数组是:

[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')]
[(0, 's'), (-1, 'tackoverflow is'), (1, 'o is very'), (0, ' cool')]

这表明,与before相比,据称在语义上有所改进的after可能会被过度“折磨” 。请注意,例如,如何将前导的“s”保留为匹配项,以及添加的“very”单词如何与“is cool”表达式的部分混合。理想情况下,我们可能会期望像

[(-1, 'stackoverflow'), (1, 'so'), (0, ' is '), (-1, 'very'), (0, ' cool')]
于 2013-04-18T15:09:12.537 回答