我正在使用 Nutch 1.6 抓取一些论坛并使用 Solr 1.6.2 对其进行索引。我在 Solr 上运行了一个测试查询,很惊讶只有几个结果。我担心 Nutch 对页面的解析或 Solr 的索引存在问题。在四处窥探之后,我发现 Nutch 并没有解析它检索到的很多页面:
bin/nutch readseg -list -dir crawl-mothering2/segments/
NAME GENERATED FETCHED PARSED
20130228001531 23 27 9
20130228003940 1430 1434 661
20130228001829 202 206 105
20130228061337 1068 1090 475
20130228091009 1 2 0
20130228085956 34 34 25
20130228090348 44 45 34
20130228090851 7 7 6
20130228080438 364 374 192
20130228030933 1774 1795 903
20130228084205 168 169 63
但是当我尝试解析这些段时,我得到了这个:
bin/nutch parse crawl-mothering2/segments/*
ParseSegment: starting at 2013-03-21 00:20:43
ParseSegment: segment: crawl-mothering2/segments/20130228001531
Exception in thread "main" java.io.IOException: Segment already parsed!
at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216)
是什么赋予了?