nutch - Nutch：在一定深度内抓取每个 URL

Question

我的问题是从某个种子列表开始抓取每个页面和每个文档。

我已经安装了 nutch 并使用以下命令运行它：

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

我预计 nutch 进程会抓取类似 100 个 url 的内容，但它说它只找到了 11 个文档。所以我尝试用这个命令运行 nutch：

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 4

它找到了 23 个文件。

我正在从测试种子http://nutch.apache.org开始运行该过程

为什么nutch有这种行为？如何设置 nutch 从我的种子开始以一定深度抓取每个网址？

score 6 · Accepted Answer

topN 设置要在每个深度中获取的 url 的数量。在您的第一个示例中，深度为 3。深度 1 是种子网址。在 depth2 和 depth3 中，将获取 5(topN value) 个 url。5 * 2（深度2和深度3）+ 1（种子网址，即深度1）= 11。要获取更多网址，您可以增加topN。如果您不想限制，则可以跳过 topN 参数。

nutch - Nutch：在一定深度内抓取每个 URL

1 回答 1

Related

Reference