I made myself familiar with crawling with Apache Nutch and Solr, but realized that while HTTP and HTTPS links are available in Solr query results in the content
field magnet links are not. I adjusted conf/regex-urlfilter.txt
to be
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# for linuxtracker.org
+^https?://*linuxtracker.org/(.+)*$
#+^magnet:\?xt=(.+)*$
# causes magnet links to be ignored/not appear in content field
+^magnet:*$
# reject anything else
-.
and don't see why magnet links shouldn't be included inside content
. As you can see, I'm investigating this using http://linuxtracker.org which e.g. has the magnet link magnet:?xt=urn:btih:ETDW2XT7HJ2Y6B4Y5G2YSXGC5GWJPF6P on http://linuxtracker.org/?page=torrent-details&id=24c76d5e7f3a758f0798e9b5895cc2e9ac9797cf.
After crawling with bin/crawl
there're magnet links when querying Solr as follows in pysolr
:
solr = pysolr.Solr(solr_core_url, timeout=10)
results = solr.search('*:*')
for result in results:
print(result)
I'm using Apache Nutch release-1.13-73-g9446b1e1 and Solr 6.6.1 on Ubuntu 17.04.