nutch - nutch 重定向处理问题

Question

我对 nutch 有点陌生。事情是我正在抓取一个重定向到另一个 url 的 url。现在在分析我的抓取结果时，我得到了第一个 url 的内容以及状态代码： temp redirected to (second url name) 。现在我的问题是，为什么我没有获得第二个 url 的内容和详细信息。重定向的 url 是否被抓取？请帮忙。

score 1 · Accepted Answer

同样，在无所不能的nutch-default.xml中，有一个属性可以控制 Nutch 如何处理重定向。

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

正如描述所提到的，fetcher won't immediately follow redirected URLs and record them for later fetching. 我还没有弄清楚如何强制获取 URL db_redir_temp。但是，如果您一开始就更改配置，我认为您可能会消失。

score 0 · Accepted Answer

在 Nutch2.3.1 中，我尝试在我的 nutch-site.xml 中设置以下属性，它帮助我在下一次尝试中获取重定向的 URL。这可能对尝试使用 Nutch 2.3.1 的人有所帮助。

<property>
      <name>db.fetch.interval.default</name>
      <value>0</value>
      <description>The default number of seconds between re-fetches of a page (30 days).
      </description>
  </property>

score 0 · Accepted Answer

在 Nutch 2.3.1 中，类中有一个名为 getProtocolOutput 的方法

org.apache.nutch.protocol.http.api.HttpBase

在这个方法中有一个对另一个方法的调用

Response response = getResponse(u, page, false); (Line 250)

将前面代码中的 false 改为 true

由于此标志指的是 followRedirects

然后重新编译 nutch 类，并遵循重定向将正常工作:)

nutch - nutch 重定向处理问题

3 回答 3

Related

Reference