0

大家好,我正在构建一个从某些网站获取新闻的小型网络爬虫。我正在使用 Typhoeus。

我的代码是这样的:

request = Typhoeus::Request.new(url, timeout: 60)
request.on_complete do |response|
    doc = Nokogiri::HTML(response.body)
    root_url = source.website.url
    links = doc.css(css_selectors).take(20)
end
hydra.queue(request)
hydra.run

问题是一些网站请求返回页面的旧版本。我尝试设置标题并包含“Cache-Control”=> 'no-cache' 但这没有帮助!任何帮助将不胜感激。

使用 open-uri 时也会发生同样的事情。

该网站的响应标头之一:

{"Server"=>"nginx/1.10.2", "Date"=>"Sat, 07 Jan 2017 12:43:54 GMT", "Content-Type"=>"text/html; charset=utf-8", "Transfer-Encoding"=>"chunked", "Connection"=>"keep-alive", "X-Drupal-Cache"=>"MISS", "X-Content-Type-Options"=>"nosniff", "Etag"=>"\"1483786108-1\"", "Content-Language"=>"ar", "Link"=>"</taxonomy/term/1>; rel=\"shortlink\",</Actualit%C3%A9s>; rel=\"canonical\"", "X-Generator"=>"Drupal 7 (http://drupal.org)", "Cache-Control"=>"public, max-age=0", "Expires"=>"Sun, 19 Nov 1978 05:00:00 GMT", "Vary"=>"Cookie,Accept-Encoding", "Last-Modified"=>"Sat, 07 Jan 2017 10:48:28 GMT", "X-Cacheable"=>"YES", "X-Served-From-Cache"=>"Yes"}
4

1 回答 1

0

这应该有效:

"Cache-Control" => 'no-cache, no-store, must-revalidate'
"Pragma" => 'no-cache'
"Expires" => '0'
于 2017-01-07T12:41:22.260 回答