nutch - 如何使用 Apache Nutch 抓取有空间的网址？

Question

我正在使用 nutch 进行爬行，但在有空间的 url 上它会失败。我已经浏览了这个链接http://lucene.472066.n3.nabble.com/URL-with-Space-td619127.html但没有得到满意的答案。

它适用于 seed.txt 文件中的 URL，但不适用于页面解析内容中的 URL

我使用了一个在 conf/seed.txt 文件中有空格的 URL，它用 %20 替换了空格，我能够抓取该页面。我在 regex-normalize.xml 中添加了以下内容

<regex> 
 <pattern> </pattern> 
 <substitution>%20</substitution> 
</regex>

另外，我在 nutch-site.xml 中添加了 regex-normalize.xml 的引用。但我仍然面临同样的问题。

score 1 · Accepted Answer

我遇到了同样的问题，但字符更多，所以我更改了 Fetcher.java！新 URL 添加到“喂食”部分的队列中！你必须找到这一行：

nURL.set(url.toString());

并将其替换为：

nURL.set(URIUtil.encodeQuery(url.toString()));

score 1 · Accepted Answer

我遇到了同样的问题并将其添加到我的 regex-normalize.xml

<regex> 
   <pattern>&#x20;</pattern> 
   <substitution>%20</substitution> 
</regex>

2 回答 2