我遇到了一个似乎无法调试的 Nutch 问题。
我开始使用 Nutch 抓取我们的页面并将其索引到 solr core 1。它运行良好。工作按原样完成。
虽然我想开始索引或分页到我们的 solr core 0,以及我们想要索引的其他项目。
索引不是问题,它会抓取和索引。但是在核心 0 上,它在索引末尾的重复数据删除任务上继续失败。我收到以下错误(如下)。据我所知,schema.xml 和 solrconfig.xml 文件在 core0 和 core1 中具有所有相同的东西,除了在 core0 中,不再需要 url 字段,因为其他索引项没有 url,所以 id field 是所有这些字段的标准必填字段。会不会是这个问题造成的?重复数据删除器试图做什么,什么阻碍了它?我怎么能通过这个?谢谢!:
2013-07-26 16:55:17,797 INFO solr.SolrIndexWriter - Indexing 157 documents
2013-07-26 16:55:30,407 INFO solr.SolrMappingReader - source: content dest: content
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: title dest: title
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: host dest: host
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: segment dest: segment
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: boost dest: boost
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: digest dest: digest
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: url dest: id
2013-07-26 16:55:30,444 INFO solr.SolrMappingReader - source: url dest: url
2013-07-26 16:55:31,590 INFO indexer.IndexingJob - Indexer: finished at 2013-07-26 16:55:31, elapsed: 00:00:19
2013-07-26 16:55:31,593 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2013-07-26 16:55:31
2013-07-26 16:55:31,593 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://<domain>:<port>/solr/core0/
2013-07-26 16:55:32,043 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2013-07-26 16:55:32,043 WARN mapred.LocalJobRunner - job_local1142877999_0055
java.lang.Exception: java.lang.NullPointerException
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.io.Text.encode(Text.java:388)
at org.apache.hadoop.io.Text.set(Text.java:178)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)