html - Bash diff body text of html file only

Question

I'm writing a shell script which tracks the changes of a website and emails me with the contents of the change if one occurs. The idea is to use wget to grab a copy of the html and compare it to the version from the last time the script ran. Wget works fine to save the html file but I'm having trouble comparing the files. The trouble is that I'm only interested in changes in the html file's plain text, not the code, links, etc.

Diff works to find all the changes in the two files but it ALWAYS returns changes even when the plain text is identical. This is because each link on the site has a corresponding authenticity token that differs each time the page is accessed. In order to diff only the lines that include plain text I'm attempting to filter it to exclude any line that begins with "<" OR "(any_amount_of_spaces)<". I've looked at the diff man page but I can't seem to find an operator that will do what I need. I don't know much about REGEX but would that work with diff -I for this?

Thanks!

score 3 · Accepted Answer

您可以使用来呈现lynx -dump页面并将其提供给对于您的用例）。diffReferencesawk

如果您不介意使用 3rd-party 的东西，请选择html2text：

diff <(html2text before.html) <(html2text after.html)

PS：有两个不同的程序称为html2text.

html - Bash diff body text of html file only

1 回答 1

Related

Reference