0

当我尝试使用 eclipse 启动器运行 org.apache.nutch.crawl.Crawler 类时,出现以下异常。我对此没有任何想法。

java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
13/07/30 21:14:26 INFO mapred.JobClient:  map 100% reduce 0%
13/07/30 21:14:26 INFO mapred.JobClient: Job complete: job_local_0002
13/07/30 21:14:26 INFO mapred.JobClient: Counters: 12
13/07/30 21:14:26 INFO mapred.JobClient:   FileSystemCounters
13/07/30 21:14:26 INFO mapred.JobClient:     FILE_BYTES_READ=47606
13/07/30 21:14:26 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=97164
13/07/30 21:14:26 INFO mapred.JobClient:   Map-Reduce Framework
13/07/30 21:14:26 INFO mapred.JobClient:     Reduce input groups=0
13/07/30 21:14:26 INFO mapred.JobClient:     Combine output records=0
13/07/30 21:14:26 INFO mapred.JobClient:     Map input records=0
13/07/30 21:14:26 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/07/30 21:14:26 INFO mapred.JobClient:     Reduce output records=0
13/07/30 21:14:26 INFO mapred.JobClient:     Spilled Records=0
13/07/30 21:14:26 INFO mapred.JobClient:     Map output bytes=0
13/07/30 21:14:26 INFO mapred.JobClient:     Combine input records=0
13/07/30 21:14:26 INFO mapred.JobClient:     Map output records=0
13/07/30 21:14:26 INFO mapred.JobClient:     Reduce input records=0
Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=null
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)
    at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
    at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

在做了一些谷歌之后发现了这个(在 Nutch2.x 中不推荐使用提到的类,而不是 应该使用$NutchHome/src/bin/crawl脚本)。即使我尝试从 cygwin 终端运行爬网脚本,但没有运气。来自终端的错误屏幕截图。

**在此处输入图片描述**

4

1 回答 1

0

您应该将文件 $NutchHome/src/bin/crawl 复制到部署目录: $NutchHome/runtime/deploy/bin 然后运行抓取命令脚本:

爬行 <seedDir> < crawlId> <numberOfRounds>

希望这可以帮助。

于 2013-08-26T18:05:39.900 回答