如果您操作搜索引擎或机器人,或者您重新发布了所有 Quora 内容的很大一部分(我们可能会根据我们的合理判断确定),您还必须遵守以下规则:
- 您必须使用描述性的用户代理标头。
- 您必须始终关注 robots.txt。
- 您必须明确如何与您联系,无论是在您的用户代理字符串中,还是在您的网站上(如果有的话)。
Additional header fields can be specified by an optional hash argument.
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo@bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
Robots.txt 可以从http://www.quora.com/robots.txt
. 你需要解析它并尊重它的设置,否则他们会再次禁止你。
Also, if you are spidering their site for content, you might want to look into caching pages locally, or using one of the spidering packages. It's easy to write a spider. It's more work to write one that plays nicely with a site but better that than not be able to spider their site at all.