web-crawler - Prioritizing recursive crawl in Storm Crawler

Question

When crawling the world wide web, I would want to give my crawler an initial seed list of URLs - and would expect my crawler to automatically 'discover' new seed URLs from internet during it's crawling.

I see such option in Apach Nutch (see topN parameter in generate command of nutch). Is there any such option in Storm Crawler as well?

score 1 · Accepted Answer

StormCrawler 可以处理递归爬取，并且 URL 的优先级方式取决于用于存储 URL 的后端。

例如，可以使用Elasticsearch 模块，请参阅 README 以获得简短教程和示例配置文件，默认情况下，spout 将根据它们的 nextFetchDate (**.sort.field*) 对 URL 进行排序。

在 Nutch 中，-topN 参数仅指定要放入下一段的 URL 的最大数量（基于使用的评分插件提供的分数）。使用 StormCrawler，我们真的不需要等价物，因为事情不是按批次处理的，爬网是连续运行的。

web-crawler - Prioritizing recursive crawl in Storm Crawler

1 回答 1

Related

Reference