java - 在 Nutch 源代码中启动 Solr 索引

Question

我正在尝试将我的 Nutch 爬行索引到 solr，但在源代码内部，而不是从命令行。

我创建了以下功能

public static int runInjectSolr(String[] args, Properties prop) throws Exception{       
    String solrUrl = "http://ec2-X-X-X-X.compute-1.amazonaws.com/solr/collection1";

    String crawldb = JobBase.getParam(args,"crawldb", null, true);
    String segments = JobBase.getParam(args,"segments", null, true);
    String args2[] = {crawldb, segments};

    Configuration conf = new Configuration();
    conf.set("-D solr.server.url",solrUrl);
    int code = ToolRunner.run(NutchConfiguration.create(),
            new IndexingJob(conf), args2);
    return code;
}

但我收到以下错误：

2013-08-07 19:37:13,338 ERROR org.apache.nutch.indexwriter.solr.SolrIndexWriter (main): Missing SOLR URL. Should be set via -D solr.server.url 
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication

所以我假设我没有正确创建我的配置。有什么建议么？

或者我应该将我的配置字段传递给以不同的方式运行吗？也许不使用

NutchConfiguration.create()

score 1 · Accepted Answer

您的代码中有两个问题：

solr.server.url必须直接在配置对象中设置，不带-D选项。nutch 给出的消息假定从命令行运行，这在此处具有误导性。
正如您所提到的，您正在传递两个不同的配置实例。它在NutchConfiguration.create()内部创建了一个 hadoop 配置，并向其中添加了一些特定于 nutch 的资源，因此您无需自己创建它。此外，ToolRunner 将 conf 对象传递给 IndexingJob，因此您无需通过其构造函数传递它。

所以正确的代码是：

Configuration conf = NutchConfiguration.create();
conf.set("solr.server.url", solrUrl);
ToolRunner.run(conf, new IndexingJob(), args2);

java - 在 Nutch 源代码中启动 Solr 索引

1 回答 1

Related

Reference