问题标签 [stormcrawler]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

209 问题

0 投票

1 回答

159 浏览

web-crawler - StormCrawler 如何识别种子 URL？

我正在使用带有 mysql 的风暴爬虫。

我有 100 个种子 URL，但我的缓冲区大小仅为 50。

如果一些种子的外链落在桶号为零会发生什么。在那种情况下，这些外链是否也会被视为种子？

Storm Crawler 如何区分种子 url 和其他 url？

2018-09-20T15:17:05.607

0 投票

1 回答

56 浏览

web-crawler - 测试 LinkParseFilter 的快速方法

我想知道是否有一种快速的方法来对 LinkParseFilter 配置进行单元测试。

例如，如果我有一个带有 LinkParseFilter 的 parsefilter 文件，如下所示：

使用一些示例页面内容对其进行单元测试以检查它是否提取了我想要的内容的最快方法是什么？

web-crawler stormcrawler

2018-10-04T11:09:42.443

0 投票

1 回答

530 浏览

web-crawler - 如何使用 Storm Crawler 抓取文档（.pdf、.docx 等）

我正在使用 Storm crawler 1.10。我也在尝试包含爬虫来爬取文档。我根据一些研究添加了 tika 解析器，但爬虫没有抓取.pdf网址。当我应用 tika 函数时，html 页面内的新行（\n）正在爬行，当我签入 kibana 时，这看起来很奇怪。正则表达式中的文档没有限制。我正在共享我的配置。在这种只抓取文档的情况下，任何人都可以帮助我。

web-crawler stormcrawler

2018-10-18T13:37:14.517

0 投票

1 回答

186 浏览

web-crawler - 如何从 StormCrawler 提取的文本中排除脚本和样式标签？

我正在使用 Storm crawler 1.10 和 Elastic Search 6.3.x。我在配置中添加了 http.content.limit=-1。爬虫运行良好，当我检查结果函数和 css 数据显示在索引中时。是否有可能在 parserfilter.json 中应用 xpath 过滤器（例如：<script>和<style>）或任何其他限制爬虫以避免这些的建议。我正在分享一些记录中显示的示例数据。

web-crawler stormcrawler

2018-10-20T18:20:59.860

0 投票

1 回答

192 浏览

web-crawler - Does Stormcrawler follow secondary JavaScript page content loads?

From looking at my scraped results for webmd.com, it seems it may not and I guess it's way too much to expect that it would since that would be very complicated. But I figured I'd ask anyway to double check.

So, if I have a page that uses JavaScript to load its body after the initial page load, does Stormcrawler have any method by which it will wait for this secondary content to load and then scrape the page?

I imagine no crawler does this except very very high level and complicated crawlers like what Google or Bing might use - or maybe even they don't since it would require browser-level intelligence and complexity. The thought of how you'd even implement a behavior of this stature is anxiety-producing.

web-crawler nutch stormcrawler

2018-10-22T20:22:53.217

0 投票

1 回答

340 浏览

regex - 将正则表达式过滤器应用于爬虫以爬取特定页面

我正在使用 Storm crawler 1.10 和 Elastic Search 6.3.x。例如，我有一个主网站https://www.abce.org，它有子页面https://abce.org/def和https://abce.org/ghi. 我想专门抓取https://www.abce.org/ghi.

我的种子网址是https://www.abce.org/ghi/.

目前我每次都在不同的正则表达式过滤器下应用。