solr - 如何防止使用 apache nutch 抓取外部链接？

Question

我只想在 nutch 上抓取特定的域。为此，我将其设置db.ignore.external.links为true，如本常见问题解答链接中所述

问题是 nutch 开始只抓取种子列表中的链接。例如，如果我将“nutch.apache.org”放入seed.txt，它只会找到相同的网址（nutch.apache.org）。

我通过运行 200 深度的爬网脚本得到结果。它完成一个周期并生成下面的输出。

我怎么解决这个问题？

我正在使用 apache nutch 1.11

Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

此致

score 2 · Accepted Answer

您只想从特定域中获取页面。

您已经尝试过db.ignore.external.links，但这限制了 seek.txt 网址之外的任何内容。

您应该conf/regex-urlfilter.txt像nutch1 教程的示例一样尝试：

+^http://([a-z0-9]*\.)*your.specific.domain.org/

score 1 · Accepted Answer

您是否使用“抓取”脚本？如果是，请确保您给出的级别大于 1。如果您运行类似“bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1”的内容。它将仅抓取 seed.txt 中列出的 url

要抓取特定域，您可以使用 regex-urlfiltee.txt 文件。

score 0 · Accepted Answer

在 nutch-site.xml 中添加以下属性

<property> 
<name>db.ignore.external.links</name> 
<value>true</value> 
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> 
</property>

solr - 如何防止使用 apache nutch 抓取外部链接？

3 回答 3

Related

Reference