nutch - 如何使 nutch 仅索引具有某些文本的页面？

Question

我有两个要求。

第一个是我希望 Nutch 仅索引包含 html 中某些单词的页面。例如，我只希望 nutch 索引 html 中包含“wounderful”单词的页面。第二个是我希望 nutch 索引站点中的某些 URL。例如，我希望 nutch 索引类似于“mywebsite.com/XXXX/ABC/XXXX”或“mywebsite.com/grow.php/ABC/XXXX”的 URL，其中“XXXX”可以是任何长度的任何单词。

这是我的 seed.txt 文件的内容

http://mysite.org/

这是我的 regex-urlfilter.txt 的内容

+^http://mysite.org/work/.*?/text/

我评论过

#+.

通过我仍然低于错误

crawl started in: crawl
rootUrlDir = bin/urls
threads = 10
depth = 3
solrUrl=http://localhost:8983/solr/
topN = 5
Injector: starting at 2013-07-09 11:05:51
Injector: crawlDb: crawl/crawldb
Injector: urlDir: bin/urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 1
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-07-09 11:06:08, elapsed: 00:00:17
Generator: starting at 2013-07-09 11:06:08
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

score 2 · Accepted Answer

Start here to setup your desired URL pattern. Then look into plugins to parse your content and decide what should be indexed.

score 0 · Accepted Answer

它显示 Injector 在种子文件中拒绝您的 url

Injector: total number of urls rejected by filters: 1

您的正则表达式不起作用，或者会有任何其他模式拒绝您的网址，例如-.*(/[^/]+)/[^/]+\1/[^/]+\1/或-[?*!@=]

score 0 · Accepted Answer

知道这已经很老了，但只是想在 nutch-1.13 的爬行与索引过滤器相关的主题中添加我的两分钱

正则表达式 urlfilter 测试

如果你想测试你的 regex-urlfilter.txt 表达式，你可以像这样使用插件测试

$ bin/nutch plugin urlfilter-regex org.apache.nutch.urlfilter.regex.RegexURLFilter

这不会给出任何反馈，但是如果您键入 urls 并按 enter，您会看到它的回声，带有“-”或“+”前缀，告诉您该 url 是否通过了配置过滤器。

如

http://aaa.com
-http://aaa.com
http://bbb.com
+http://bbb.com

如果配置类似于

+^http://bbb.com\.*
-.*

爬行过滤器与索引过滤器

这没有很好的记录，我花了一段时间才找到线索。如果我们想做出不同的过滤精度（爬取广泛，但索引更详细），我们可以执行以下操作。

首先，如果我们使用 bin/crawl 脚本，只需添加

过滤命令末尾的 -filter 选项
指定要使用的正则表达式文件的参数；-Durlfilter.regex.file)

像这样

<  __bin_nutch index $JAVA_PROPERTIES "$CRAWL_PATH"/crawldb -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT
>  __bin_nutch index $JAVA_PROPERTIES -Durlfilter.regex.file=regex-urlfilter-index.txt "$CRAWL_PATH"/crawldb -linkdb "$CRAWL_PATH"/linkdb "$CRAWL_PATH"/segments/$SEGMENT -filter

否则，只需将这两个参数附加到 bin/nutch index 命令，如果您在没有爬网脚本的情况下使用它们

现在，在“regex-urlfilter-index.txt”文件中输入所需的配置。

感谢 grokbase 中的 Arthurs 问题的洞察力： http ://grokbase.com/t/nutch/user/1579evs40h/filtering-at-index-time-with-a-different-regex-urlfilter-txt-from-crawl

nutch - 如何使 nutch 仅索引具有某些文本的页面？

3 回答 3

Related

Reference