ruby - 使用 ruby 的 open-uri 访问特定站点时出现 503 错误

Question

我一直在使用下面的代码来爬取一个网站，但我想我可能爬得太多了，让自己完全被禁止访问该网站。例如，我仍然可以在浏览器上访问该站点，但是任何涉及 open-uri 和该站点的代码都会向我抛出 503 站点不可用错误。我认为这是特定于站点的，因为 open-uri 仍然可以与 google 和 facebook 配合使用。有解决方法吗？

require 'rubygems'
require 'hpricot'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.quora.com/What-is-the-best-way-to-get-ove$

topic = doc.at('span a.topic_name span').content
puts topic

score 5 · Accepted Answer

有一些变通办法，但最好的办法是根据他们的条件成为一个好公民。您可能需要确认您遵守他们的服务条款：

如果您操作搜索引擎或机器人，或者您重新发布了所有 Quora 内容的很大一部分（我们可能会根据我们的合理判断确定），您还必须遵守以下规则：

您必须使用描述性的用户代理标头。
您必须始终关注 robots.txt。
您必须明确如何与您联系，无论是在您的用户代理字符串中，还是在您的网站上（如果有的话）。

您可以使用OpenURI轻松设置用户代理标头：

Additional header fields can be specified by an optional hash argument.

  open("http://www.ruby-lang.org/en/",
    "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "foo@bar.invalid",
    "Referer" => "http://www.ruby-lang.org/") {|f|
    # ...
  }

Robots.txt 可以从http://www.quora.com/robots.txt. 你需要解析它并尊重它的设置，否则他们会再次禁止你。

此外，您可能希望通过在循环之间休眠来限制代码的速度。

Also, if you are spidering their site for content, you might want to look into caching pages locally, or using one of the spidering packages. It's easy to write a spider. It's more work to write one that plays nicely with a site but better that than not be able to spider their site at all.

ruby - 使用 ruby​​ 的 open-uri 访问特定站点时出现 503 错误

1 回答 1

Related

Reference

ruby - 使用 ruby 的 open-uri 访问特定站点时出现 503 错误