solr - Nutch：数据读取和添加元数据

Question

我最近开始寻找 apache nutch。我可以进行设置并能够使用 nutch 抓取我感兴趣的网页。我不太了解如何读取这些数据。我基本上想将每个页面的数据与一些元数据（现在是一些随机数据）相关联并将它们存储在本地，稍后将用于搜索（语义）。我需要同样使用 solr 或 lucene 吗？我对所有这些都是新手。据我所知，Nutch 是用来抓取网页的。它可以做一些额外的功能，比如将元数据添加到爬取的数据中吗？

score 3 · Accepted Answer

Useful commands.

Begin crawl

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Get statistics of crawled URL's

bin/nutch readdb crawl/crawldb -stats

Read segment (gets all the data from web pages)

bin/nutch readseg -dump crawl/segments/* segmentAllContent

Read segment (gets only the text field)

bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate -     noparse -noparsedata

Get all list of known links to each URL, including both the source URL and anchor text of the link.

bin/nutch readlinkdb crawl/linkdb/ -dump linkContent

Get all URL's crawled. Also gives other information like whether it was fetched, fetched time, modified time etc.

bin/nutch readdb crawl/crawldb/ -dump crawlContent

For the second part. i.e to add new field I am planning to use index-extra plugin or to write custom plugin.

Refer:

this and this

solr - Nutch：数据读取和添加元数据

1 回答 1

Related

Reference