java - 如何将整个本地硬盘索引到 Apache Solr？

Question

Solr 或向 Solr 提供客户端库以索引整个硬盘驱动器的好方法是否存在。这应该包括 zip 文件中的内容，包括 zip 文件中递归的 zip 文件？

这应该能够在 Linux 上运行（没有仅限 Windows 的客户端）。

当然，这将涉及从根目录（或实际上的任何文件夹）对整个文件系统进行一次扫描。在这一点上，我并不关心保持索引是最新的，只是最初创建它。这将类似于 Google 停止使用的旧版“Google 桌面”应用程序。

score 2 · Accepted Answer

您可以使用 SolrJ API 操作 Solr。

这是 API 文档： http: //lucene.apache.org/solr/4_0_0/solr-solrj/index.html

这是一篇关于如何使用 SolrJ 为硬盘上的文件建立索引的文章。
http://blog.cloudera.com/blog/2012/03/indexing-files-via-solr-and-java-mapreduce/

文件由表示InputDocument，您可以.addField用来附加您想在以后搜索的字段。

这是索引驱动程序的示例代码：

public class IndexDriver extends Configured implements Tool {     

  public static void main(String[] args) throws Exception {
    //TODO: Add some checks here to validate the input path
    int exitCode = ToolRunner.run(new Configuration(),
     new IndexDriver(), args);
    System.exit(exitCode);
  }

  @Override
  public int run(String[] args) throws Exception {
    JobConf conf = new JobConf(getConf(), IndexDriver.class);
    conf.setJobName("Index Builder - Adam S @ Cloudera");
    conf.setSpeculativeExecution(false);

    // Set Input and Output paths
    FileInputFormat.setInputPaths(conf, new Path(args[0].toString()));
    FileOutputFormat.setOutputPath(conf, new Path(args[1].toString()));
    // Use TextInputFormat
    conf.setInputFormat(TextInputFormat.class);

    // Mapper has no output
    conf.setMapperClass(IndexMapper.class);
    conf.setMapOutputKeyClass(NullWritable.class);
    conf.setMapOutputValueClass(NullWritable.class);
    conf.setNumReduceTasks(0);
    JobClient.runJob(conf);
    return 0;
  }
}

阅读文章了解更多信息。

压缩文件 以下是有关处理压缩文件的信息：使用 Solr CELL 的 ExtractingRequestHandler 从包格式中索引/提取文件

Solr 似乎有一些错误不处理 zip 文件，这是带有修复的错误报告：https ://issues.apache.org/jira/browse/SOLR-2416

java - 如何将整个本地硬盘索引到 Apache Solr？

1 回答 1

Related

Reference