问题标签 [stormcrawler]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

209 问题

0 投票

1 回答

36 浏览

apache-storm - 我可以在爬网过程中增加工人吗

在 Storm Crawler 1.10 和 Apache Storm 1.2.2 上工作。在进行爬网过程时，如何更改工作人员和获取线程的数量。

apache-storm stormcrawler

2018-11-21T15:00:17.717

0 投票

1 回答

272 浏览

web-crawler - 加快爬取进程

使用 ES 6.5.x 和 Storm crawler 1.10。如何加快爬虫获取记录的速度。当我检查其显示的指标时，平均每秒显示 0.4 页。在下面的爬虫配置中我需要更改什么吗？

履带式会议：

web-crawler stormcrawler

2018-11-21T16:36:01.810

0 投票

1 回答

254 浏览

logging - 本地模式下的 StormCrawler 调试日志

当我检查 StormCrawler 的源代码时，有很多有用的调试日志。但是放置 log4j.xml 并添加记录器不会在控制台中打印它们。在 StormCrawler 中启用日志记录应该遵循哪些步骤？

logging apache-storm stormcrawler

2018-11-23T01:34:50.300

0 投票

1 回答

122 浏览

stormcrawler - Stormcrawler XPathFilter - 内部表示

当 Stormcrawler 获取网站时，它会将配置的 XPathFilter 应用于不是原始 HTML 表示的 HTML 表示。例如，插入标签，或者DIV将变为H3等。例如，以下配置将HTML代码放入不是原始的Elasticsearch中：

这使得很难根据网站的原始源代码编写 XPath 表达式。有什么方法可以配置 Stormcrawler 以在原始网站源代码上应用 XPathFilter 表达式？

stormcrawler

2018-11-29T11:38:08.547

0 投票

1 回答

40 浏览

stormcrawler - 限制 Stormcrawler 中的水平深度（页面发现的外链数量）

我正在使用stormcrawler，我想知道是否有限制页面发现的外链数量。我看起来像db.max.outlinks.per.page Nutch 的东西。提前致谢

stormcrawler

2018-11-29T21:55:36.880

0 投票

1 回答

113 浏览

web-crawler - 如何限制重复的类似网址抓取

在 Storm crawler 1.10 和 ES 6.4.2 上工作。抓取过程完成后，当我检查记录时，抓取工具正在抓取具有相同标题和描述的https和http url，我如何告诉抓取工具只抓取其中一个 url。

web-crawler stormcrawler

2018-12-03T16:59:31.863

0 投票

1 回答

963 浏览

java - 关闭 SSL 证书验证

使用 Storm Crawler 1.12.1 和 Elastic search 6.5.x。我的爬虫正在运行http://localhost:8080，弹性搜索正在运行。https://localhost:9200我正在尝试爬取网站。在 URL 注入过程中，我遇到了javax.net.ssl.SSLHandshakeException: General SSLEngine problem错误并在此处查看详细错误。

我尝试了 OKHttp 并添加https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"到 crawler-conf.yaml 中。

如何暂时关闭证书验证。

java web-crawler stormcrawler

2019-01-02T18:12:19.827

0 投票

1 回答

39 浏览

elasticsearch - Will the Crawler reindex the records after deleted

Working on Storm Crawler 1.12.1 and Elastic Search 6.5.2. I need to increase the efficiency of my search engine. I deleted some of the documents for security reasons after indexing documents into the elastic search. So my question is that the storm crawler will re grab the deleted urls and re-index again? I don't want to re-crawl the deleted records,How can I achieve this?

elasticsearch web-crawler stormcrawler

2019-01-07T15:39:09.987

0 投票

1 回答

89 浏览