我选择了 cassandra 作为后端并开始使用 nutch。
DMOZ url 的一小部分(~50k),所有(注入、生成、获取)运行良好。
然而,在我注入整个 DMOZ url 集 (~3.5M) 并尝试生成一个 fetchlist 后,我得到了以下错误,该错误在另一个系统上可以重现:
~/software/nutch_dmoz/local$ ./bin/nutch generate -topN 1000
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: topN: 1000
GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:191)
at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:213)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:241)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:249)
日志/hadoop.log:
2013-04-25 17:58:07,986 INFO crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch.
2013-04-25 17:58:08,007 INFO crawl.GeneratorJob - GeneratorJob: starting
2013-04-25 17:58:08,007 INFO crawl.GeneratorJob - GeneratorJob: filtering: true
2013-04-25 17:58:08,007 INFO crawl.GeneratorJob - GeneratorJob: topN: 1000
2013-04-25 17:58:08,570 INFO connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10
s
2013-04-25 17:58:08,660 INFO service.JmxMonitor - Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorT
ype=hector
2013-04-25 17:58:09,029 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes w
here applicable
2013-04-25 17:58:09,403 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2013-04-25 17:58:09,435 INFO plugin.PluginRepository - Plugins: looking in: /home/sethunder/software/nutch_dmoz/local/plugins
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Registered Plugins:
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor)
2013-04-25 17:58:09,560 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Registered Extension-Points:
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Parse Filter (org.apache.nutch.parse.ParseFilter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2013-04-25 17:58:09,561 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2013-04-25 17:58:09,582 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-04-25 17:58:09,582 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2013-04-25 17:58:09,582 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2013-04-25 17:58:11,046 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2013-04-25 18:01:02,936 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-04-25 18:01:02,936 WARN mapred.LocalJobRunner - job_local_0001
java.lang.ArrayIndexOutOfBoundsException
2013-04-25 18:01:03,412 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
据我所知,我没有用完磁盘空间。/tmp 分区有 250G 可用空间,运行 cassandra 的分区有 2.5T 可用空间。有没有可能增加详细程度?另外,我想知道 ArrayOutOfBoundsException 并没有告诉它试图访问的边界,什么也没有。密钥空间网页已存在,我可以使用 cassandra-cli 访问它。这是 readdb -stats 的输出:
~/software/nutch_dmoz/local$ ./bin/nutch readdb -stats
WebTable statistics start
Statistics for WebTable:
min score: 55.0
retry 0: 3576393
jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score: 1.0
TOTAL urls: 3576393
status 0 (null): 3576393
avg score: 1.0
WebTable statistics: done
min score: 55.0
retry 0: 3576393
jobs: {db_stats-job_local_0001={jobID=job_local_0001, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=1609, MAP_INPUT_RECORDS=3576393, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=858, MAP_OUTPUT_BYTES=189548829, COMMITTED_HEAP_BYTES=1521614848, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=1010, COMBINE_INPUT_RECORDS=14305902, REDUCE_INPUT_RECORDS=114, REDUCE_INPUT_GROUPS=114, COMBINE_OUTPUT_RECORDS=444, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=114, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=14305572}, FileSystemCounters={FILE_BYTES_READ=910481, FILE_BYTES_WRITTEN=1028473}, File Output Format Counters ={BYTES_WRITTEN=2421}}}}
max score: 1.0
TOTAL urls: 3576393
status 0 (null): 3576393
avg score: 1.0