0

I made myself familiar with crawling with Apache Nutch and Solr, but realized that while HTTP and HTTPS links are available in Solr query results in the content field magnet links are not. I adjusted conf/regex-urlfilter.txt to be

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# for linuxtracker.org
+^https?://*linuxtracker.org/(.+)*$
#+^magnet:\?xt=(.+)*$
    # causes magnet links to be ignored/not appear in content field
+^magnet:*$

# reject anything else
-.

and don't see why magnet links shouldn't be included inside content. As you can see, I'm investigating this using http://linuxtracker.org which e.g. has the magnet link magnet:?xt=urn:btih:ETDW2XT7HJ2Y6B4Y5G2YSXGC5GWJPF6P on http://linuxtracker.org/?page=torrent-details&id=24c76d5e7f3a758f0798e9b5895cc2e9ac9797cf.

After crawling with bin/crawl there're magnet links when querying Solr as follows in pysolr:

solr = pysolr.Solr(solr_core_url, timeout=10)
results = solr.search('*:*')
for result in results:
    print(result)

I'm using Apache Nutch release-1.13-73-g9446b1e1 and Solr 6.6.1 on Ubuntu 17.04.

4

1 回答 1

1

简短回答磁铁链接不是“正常”链接,Nutch 不支持开箱即用。

长答案:

在提取链接后应用您更改的配置,在这种情况下,如果您使用parse-html解析插件,请尝试评估可能的外链接是否是有效链接,这基本上只是创建一个java.net.URL.

java.net.URL另一方面,根据javadocs,不支持开箱即用的磁力链接:

以下协议的协议处理程序保证存在于搜索路径上:

 http, https, ftp, file, and jar

如果您正在使用parse-tika 类似的事情正在发生

一种选择可能是让您的自定义解析器为您处理此问题,请记住,在任何情况下,您都不想遵循(作为外链)磁力链接,因为 Nutch 将无法处理这些链接。

如果您只想在 Solr/ES 中索引链接(用于搜索),那么您可以编写自己HtmlParseFilter的链接并将这些链接添加到单独的字段中。

于 2017-10-06T10:35:34.550 回答