0

I am currently using Nutch 2.2.1 and HBase 0.90.4. I am expecting around 300K urls from about 10 URLS in seed. I have already generated so much while using Nutch 1.6. Since I want to manipulate data, I preferred to go Nutch 2.2.1 + HBase route. But I get all sorts of weird errors and crawl doesn't seem to progress.

Various errors such as:

  1. zookeeper.ClientCnxn - Session for server null, unexpected error, closing socket connection and attempting reconnect. - I get this more frequently

  2. bin/crawl: line 164: killed - I get this error from fetch step and the crawling gets killed all of a sudden.

  3. RSS parse error

I am using a all-in-one crawl command - bin/crawl urls 1 http://localhost:8983/solr/ 10

<crawl> <seed-dir> <crawl-id> <solr-url> <number of rounds>

Please suggest where am I going wrong. I have Nutch 2.2.1 installed and HBase (standalone) installed as per the Quick start guide recommended from Nutch site. I am not sure following HBase 0.90.4 standalone set up from Quick start guide link is sufficient to achieve 300K crawled urls.


Edit # 1: RSS Parse Error - log information

Error tika.TikaParser - Error parsing http://www.###.###.##/###/abc.xml org.apache.tika.exception.TikaException: RSS parse error

4

0 回答 0