python - Python：避免在抓取时下载未更改页面的最佳算法

Question

我正在编写一个爬虫，它会定期检查新闻网站列表中的新文章。我已经阅读了避免不必要的页面下载的不同方法，基本上确定了 5 个标题元素，这些元素可能有助于确定页面是否已更改：

HTTP 状态
电子标签
Last_modified（与 If-Modified-Since 请求结合使用）
过期
内容长度。

优秀的FeedParser.org似乎实现了其中一些方法。

我正在寻找做出这种决定的 Python（或任何类似语言）的最佳代码。请记住，标头信息始终由服务器提供。

这可能是这样的：

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

score 2 · Accepted Answer

在发出请求之前，您唯一需要检查的是Expires. If-Modified-Since不是服务器发送给你的东西，而是你发送给服务器的东西。

您想要做的是一个带有标头的 HTTP GET，该If-Modified-Since标头指示您上次检索资源的时间。如果您返回状态码304而不是通常的200，则资源自那时起没有被修改，您应该使用您存储的副本（不会发送新副本）。

此外，您应该保留Expires上次检索文档时的标题，GET如果您存储的文档副本尚未过期，则根本不要发出标题。

将其转换为 Python 是一个练习，但应该很简单地If-Modified-Since向请求添加标头、存储响应中的Expires标头以及检查响应中的状态代码。

score 1 · Accepted Answer

您需要将标题的字典传递给shouldDownload（或结果urlopen）：

def shouldDownload(url, headers, prev_etag, prev_lastmod, prev_expires,  prev_content_length):
    return (prev_content_length != headers.get("content-length") || prev_lastmod != headers.get("If-Modified-Since") || prev_expires != headers.get("Expires") || prev_etag != headers.get("ETAG"))
    # or the optimistic way:
    # return prev_content_length == headers.get("content-length") and prev_lastmod == headers.get("If-Modified-Since") and prev_expires = headers.get("Expires") and prev_etag = headers.get("ETAG")

打开 URL 时执行此操作：

# my urllib2 is a little fuzzy but I believe `urlopen()` doesn't 
#  read the whole file until `.read()` is called, and you can still 
#  get the headers with `.headers`.  Worst case is you may have to 
#  `read(50)` or so to get them.
s = urllib2.urlopen(MYURL)
try:
    if shouldDownload(s.headers):
        source = s.read()
        # do stuff with source
   else:
        continue
# except HTTPError, etc if you need to  
finally:
    s.close()

python - Python：避免在抓取时下载未更改页面的最佳算法

2 回答 2

Related

Reference