python - 有没有一种简单的方法来编写网页随时间比较的脚本？

Question

我有一个要监视更改的网站，特别是在 HTML 中的一个 DIV 中。我使用http://www.followthatpage.com/来监控网页的更改，但我遇到了两个问题：

它检查整个站点，而不仅仅是一个 DIV
它每小时只检查一次站点

理想情况下，我想编写一个 bash 或 python 脚本，每 15 分钟对两个文件进行一次比较，并通过电子邮件发送任何更改。我在想我也许可以diff在下载两个文件后使用该命令，并将其设置为 cron 以在有更改时发送电子邮件，但我仍然不知道如何仅过滤到特定的 DIV。

有没有更简单的方法然后自己弄清楚如何做到这一点（现有脚本）？如果没有，最好的方法是什么？

score 4 · Accepted Answer

如果您可以访问 linux 终端，另一种方法是添加一个 cronjob

$ crontab -e

并放置以下行（每天 16:00）

0   16   *   *   *   diff_web_page.sh

其中的内容diff_web_page.sh是

#!/bin/bash

URL="http://linux.die.net/man/1/bash";
TMP_FILE="/tmp/diff_page.txt";
if [[ ! -f $TMP_FILE ]]; then
    # First time that we are running, create the file and exit.
    lynx -dump "$URL" &> $TMP_FILE;
    # lynx -dump "$URL" | pcegrep -M "<div>.*</div>" > $TMP_FILE
else
    # the file exist, grub the new version and compare it
    lynx -dump "$URL" &> $TMP_FILE.new; ## use pcegrep if required.
    diff -Npaur $TMP_FILE $TMP_FILE.new;
    mv $TMP_FILE.new $TMP_FILE;
fi

每次在 user@host 中执行时，这将通过电子邮件发送网页的差异（在您正在运行此 cron 作业的 linux 机器上）。

如果你想要一个特定的 div，你可以在pcregrep -M使用 lynx 转储网页时使用awk 输出

score 3 · Accepted Answer

由于您想要的 div 是特定于站点的，因此您可能需要设置一个简单的检查。

这包括

下载 HTML -urllib.urlopen(URL)或requests.get(URL).
提取正确的部分（BeautifulSoup，自己动手）
执行比较（直接比较或 difflib）。

弄清楚什么以及如何提取数据将花费您最长的时间。我推荐在 Chrome/Firefox 中使用开发者工具。

假设我们想知道 digitalocean.com 上的计数器何时更新。计数器的 div 如下所示：

<div class='inner'>
<span class='count'>5</span>
<span class='count'>8</span>
<span class='count'>2</span>
<span class='count_delimiter'>,</span>
<span class='count'>4</span>
<span class='count'>1</span>
<span class='count'>7</span>
</div>

可悲的是，没有 id，使用 BeautifulSoup4 很容易取出。（例如soup.find(id="counter").

相反，我会选择拉出所有具有“计数”类的内部元素。

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))

BeautifulSoup具有出色的文档，可用于解析 HTML 文档而无需费力（取决于您正在抓取的站点的布局如何）。

python - 有没有一种简单的方法来编写网页随时间比较的脚本？

2 回答 2

Related

Reference