http - 我们是否也可以使用 html 获取 HTTP 状态代码 429 作为响应

翻译自：https://stackoverflow.com/questions/51394103 2018-07-18T05:22:09.317

151 次

我正在使用带有 selenium 协议的 Apache Nutch 1.14。对此的设置nutch-site.xml是

<property>
  <name>plugin.includes</name>
  <value>protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <!--<value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>-->
  <description>Regular expression naming plugin directory names to ...  
  </description>
</property>

我正在尝试抓取一个网站。我正在使用硒集线器和节点。

我得到 Http 状态代码 429。
但我也可以在浏览器上看到 html 页面。
但是 Nutch 没有为raw_html

我收到此错误

失败：http code=429，url= https://www.expedia.com/

hadoop log文件中也没有错误

http - 我们是否也可以使用 html 获取 HTTP 状态代码 429 作为响应

0 回答 0

Related

Reference