solr - 使用 Solr 索引日志文件

Question

我必须索引位于递归目录结构中的日志文件（每个目录可以有一个或多个文件和目录）。日志文件有各种不同的扩展名。搜索将基于日志文件的文本。所有包含特定字符串（搜索关键字）的文件都应连同其名称和完整路径一起作为搜索结果出现。

我尝试为此使用 DIH tika，但似乎只适用于一个文件。我尝试了 FileListEntityprocessor，但无法正常工作。

如何使用 Solr 索引这些日志文件。如果有人这样做，请帮助我。

提前致谢。

PS 单个日志文件不是很大，但整体数据很大。

score 1 · Accepted Answer

我会做这样的事情：

有一个系统生成与您的搜索匹配的输入目录。
有一个功能可以将匹配的日志或它们在这些目录中的部分解析到 RAM solr 文档中。

通过迭代器将一个目录或一组目录中的文档流式传输到 solr：

HttpSolrServer server = new HttpSolrServer();
Iterator<SolrInputDocument> iter = new Iterator<SolrInputDocument>(){
  public boolean hasNext() {
      boolean result ;
      // set the result to true false to say if you have more documensts
      return result;
  }

  public SolrInputDocument next() {
      SolrInputDocument result = null;
      // construct a new document here and set it to result
      return result;
  }
};
server.add(iter);

在此处查看此方法和其他方法：http ://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update

score 1 · Accepted Answer

TikaEntityProcessor 可以与 FileListEntityProcessor 一起使用。

数据配置.xml

<dataConfig>
    <dataSource name="bin" type="BinFileDataSource"/>
    <document>
        <entity name="f" dataSource="null" rootEntity="false"
            processor="FileListEntityProcessor" transformer="TemplateTransformer"
            baseDir="L:/Documents/65923/"
            fileName=".*\.*" onError="skip" recursive="true">

            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastmodified" />

            <entity name="tika-test" dataSource="bin" processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" onError="skip">
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="content"/>   
            </entity>
        </entity>
    </document>
</dataConfig>

solr - 使用 Solr 索引日志文件

2 回答 2

Related

Reference