solr - Nutch 1.3 和 Solr 4.4.0 集成作业失败

Question

我正在尝试使用 nutch 爬网，并按照 nutch 官方网站中的文档步骤进行操作（成功运行爬网，将 scheme-solr4.xml 复制到 solr 目录中）。但是当我运行

bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

我收到以下错误：

Indexer: starting at 2013-08-25 09:17:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : use authentication (default false)
    solr.auth : username for authentication
    solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)

我不得不提到 solr 正在运行，但我无法浏览http://localhost:8983/solr/admin（它会将我重定向到http://localhost:8983/solr/#）。

另一方面，当我停止 solr 时，我得到了同样的错误！有人知道我的设置有什么问题吗？

PS我抓取的网址是：http://localhost/NORC

score 0 · Accepted Answer

检查您的配置：Solr和Nutch

Nutch 和 Solr 的架构文件应该相同，否则您可能会遇到问题，因此请确保它们匹配

score 0 · Accepted Answer

当我在 nutch 中遇到同样的问题时，solr 的日志会出现错误消息“未知字段主机”。在 solr 中修改 schema.xml 后，nutch 的错误消失了。

score 0 · Accepted Answer

您在命令中缺少核心的名称。

例如：

./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/your_corname urls/ crawl 1

solr - Nutch 1.3 和 Solr 4.4.0 集成作业失败

3 回答 3

Related

Reference