I'm writing a shell script which tracks the changes of a website and emails me with the contents of the change if one occurs. The idea is to use wget to grab a copy of the html and compare it to the version from the last time the script ran. Wget works fine to save the html file but I'm having trouble comparing the files. The trouble is that I'm only interested in changes in the html file's plain text, not the code, links, etc.
Diff works to find all the changes in the two files but it ALWAYS returns changes even when the plain text is identical. This is because each link on the site has a corresponding authenticity token that differs each time the page is accessed. In order to diff only the lines that include plain text I'm attempting to filter it to exclude any line that begins with "<" OR "(any_amount_of_spaces)<". I've looked at the diff man page but I can't seem to find an operator that will do what I need. I don't know much about REGEX but would that work with diff -I for this?
Thanks!