我正在尝试通过 Bluemix solr 索引 nutch 爬网数据。我在命令提示符中使用了以下命令:
bin/nutch index -D solr.server.url="https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters/CLUSTER-ID/solr/admin/collections -D solr.auth= true -D solr.auth.username="USERNAME" -D solr.auth.password="PASS" 爬取/crawldb -linkdb 爬取/linkdb 爬取/segments/2016*
但它无法完成索引。结果如下:
Indexer: starting at 2016-06-16 16:31:50
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexing 153 documents
Indexing 153 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
我猜它与solr.server.url地址有关,也许是它的结尾。我以不同的方式改变了它,例如
(因为它被 Bluemix Solr 用于索引 JSON/CSV/... 文件)。但现在没有机会了。
任何人都知道我该如何解决?如果问题如我所料,任何人都知道 solr.server.url 究竟应该是什么?顺便说一句,“example_collection”是我的集合名称,我正在使用 nutch1.11。