web-crawler - 如何限制重复的类似网址抓取

Question

在 Storm crawler 1.10 和 ES 6.4.2 上工作。抓取过程完成后，当我检查记录时，抓取工具正在抓取具有相同标题和描述的https和http url，我如何告诉抓取工具只抓取其中一个 url。

Title: About Apache storm
Description:A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
url: https://www.someurl.com


Title: About Apache storm
Description:A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
url: http://www.someurl.com

score 1 · Accepted Answer

这些变体通常由站点作为重定向进行管理，因此您只会获得一个文档。或者，站点可以提供一个规范标签，StormCrawler 将其用作 URL 值（如果存在）。

StormCrawler 孤立地查看文档，不知道其他 URL。您可以通过以下方式在 SC 之外实现此功能：

查询索引时折叠结果
例如使用 MapReduce 对索引的内容进行重复数据删除

SC 中处理任何剩余重复项的一个选项是生成自定义元数据，例如内容的哈希并修改 ES Indexer bolt，以便它使用该值（如果存在）而不是文档 ID 的规范化 URL。然后，您将获得一个文档，但无法选择使用哪一个 URL（http 或 https）。

web-crawler - 如何限制重复的类似网址抓取

1 回答 1

Related

Reference