java - 配置 nutch regex-normalize.xml

Question

我正在使用基于 Java 的 Nutch 网络搜索软件。为了防止在我的搜索查询结果中返回重复的（url）结果，我试图在运行 Nutch 爬虫来索引我的 Intranet 时从被索引的 url 中删除（又名规范化）“jsessionid”的表达式。但是，我对 $NUTCH_HOME/conf/regex-normalize.xml 的修改（在运行我的爬网之前）似乎没有任何效果。

如何确保我的 regex-normalize.xml 配置正在用于我的爬网？和，
在抓取/索引期间，什么正则表达式会成功地从 url 中删除/规范化 'jsessionid' 的表达式？

以下是我当前的 regex-normalize.xml 的内容：

<?xml version="1.0"?>
<regex-normalize>
<regex>
 <pattern>(.*);jsessionid=(.*)$</pattern>
 <substitution>$1</substitution>
</regex>
<regex>
 <pattern>(.*);jsessionid=(.*)(\&amp;|\&amp;amp;)</pattern>
 <substitution>$1$3</substitution>
</regex>
<regex>
 <pattern>;jsessionid=(.*)</pattern>
 <substitution></substitution>
</regex>
</regex-normalize>

这是我发出来运行我的（测试）“抓取”的命令：

bin/nutch crawl urls -dir /tmp/test/crawl_test -depth 3 -topN 500

score 3 · Accepted Answer

您使用的是哪个版本的 Nutch？我对 Nutch 不熟悉，但 Nutch 1.0 的默认下载已经在regex-normalize.xml中包含一个似乎可以处理这个问题的规则。

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

顺便提一句。regex-urlfilter.txt似乎也包含一些相关的东西

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

然后在nutch-default.xml中有一些设置，您可能想查看这些设置

urlnormalizer.order
urlnormalizer.regex.file
plugin.includes

如果这一切都没有帮助，也许这有帮助：我如何强制 fetcher 使用自定义 nutch-config？

java - 配置 nutch regex-normalize.xml

1 回答 1

Related

Reference