hadoop - Nutch and HBase for production

翻译自：https://stackoverflow.com/questions/19169183 2013-10-03T21:02:49.113

493 次

I am currently using Nutch 2.2.1 and HBase 0.90.4. I am expecting around 300K urls from about 10 URLS in seed. I have already generated so much while using Nutch 1.6. Since I want to manipulate data, I preferred to go Nutch 2.2.1 + HBase route. But I get all sorts of weird errors and crawl doesn't seem to progress.

Various errors such as:

zookeeper.ClientCnxn - Session for server null, unexpected error, closing socket connection and attempting reconnect. - I get this more frequently
bin/crawl: line 164: killed - I get this error from fetch step and the crawling gets killed all of a sudden.
RSS parse error

I am using a all-in-one crawl command - bin/crawl urls 1 http://localhost:8983/solr/ 10

<crawl> <seed-dir> <crawl-id> <solr-url> <number of rounds>

Please suggest where am I going wrong. I have Nutch 2.2.1 installed and HBase (standalone) installed as per the Quick start guide recommended from Nutch site. I am not sure following HBase 0.90.4 standalone set up from Quick start guide link is sufficient to achieve 300K crawled urls.

Edit # 1: RSS Parse Error - log information

Error tika.TikaParser - Error parsing http://www.###.###.##/###/abc.xml org.apache.tika.exception.TikaException: RSS parse error

hadoop - Nutch and HBase for production

0 回答 0

Related

Reference