我为特定的动态网站编写了一个爬虫。所有抓取作业都需要 3 个小时以上。我想控制页面是否已经被爬取或者页面有一些变化。如果我能做到这一点,脚本将在很短的时间内完成。
例如:
foreach ($urls as $url) {
if(thereAreChanges($url)){
crawl($url);
}
}
信息:网页不提供内容长度和CRC。
Array ( [0] => HTTP/1.1 200 OK
[Date] => Tue, 08 Jan 2013 07:47:03 GMT
[Server] => Apache
[Set-Cookie] => Array (
[0] => PHPSESSID=eisb6qjme9b0ouoga9su9fgok4; path=/
[1] => j12011=a%3A3%3A%7Bs%3A3%3A%22sid%22%3Bs%3A26%3A%22eisb6qjme9b0ouoga9su9fgok4%22%3Bs%3A2%3A%22ip%22%3Bs%3A12%3A%2294.103.47.65%22%3Bs%3A4%3A%22time%22%3Bi%3A1357631223%3B%7D; expires=Sat, 09-Mar-2013 07:47:03 GMT; path=/
)
[Expires] => Thu, 19 Nov 1981 08:52:00 GMT
[Cache-Control] => no-store, no-cache, must-revalidate, post-check=0, pre-check=0
[Pragma] => no-cache
[Vary] => Accept-Encoding
[Connection] => close
[Content-Type] => text/html
)
该站点提供 Content-Type 但不提供 Content-Length。我如何向 apache 询问内容长度。
更新:http ://urivalet.com/可以获得内容长度。我需要这个。
如果我可以在标题中获得页面的 CRC 代码。这将是完美的。但我想这是长远的目标。