python - 比较 HTML 和 difflib

Question

我只希望获得此页面的可靠内容差异（结构更改很少，因此可以忽略）。更具体地说，我需要进行的唯一更改是添加了一个新的指令 ID：

为了了解difflib会产生什么，我首先对两个相同的 HTML 内容进行比较，希望什么也得不到：

url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib.urlopen(url
content = response.read()
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)

由于difflib模仿 UNIXdiff实用程序，我希望diffed不包含任何内容（或给出一些序列相同的指示，但如果我'\n'.join diffed，我会得到类似于 HTM L 的东西，（尽管它不会在浏览器中呈现）

事实上，如果我采用最简单的情况来区分两个字符：

diffed= d.compare('a', 'a')

diffed.next()产生以下内容：

'  a'

所以我要么期待difflib提供一些它不能或不会提供的东西（我应该改变策略），还是我在滥用它？区分 HTML 的可行替代方案是什么？

score 4 · Accepted Answer

的参数Differ.compare()应该是字符串序列。如果您使用两个字符串，它们将被视为序列，因此逐个字符进行比较。

所以你的例子应该重写为：

url = 'https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib.urlopen(url)
content = response.readlines()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)
print('\n'.join(diffed))

如果您只想比较 html 文件的内容，您可能应该使用解析器来处理它并只获取没有标签的文本，例如使用 BeautifulSoup 的soup.stripped_strings：

soup = bs4.BeautifulSoup(html_content)
diff = d.compare(list(soup.stripped_strings), list_to_compare_to)
print('\n'.join(diff))

python - 比较 HTML 和 difflib

1 回答 1

Related

Reference