solr - 什么是 Apache solr 中的索引？

Question

我可以将 pdf 文件上传到其中solr，并且可以搜索这些文件。但是什么是索引solr？W当我上传一个 pdf 文件时，它将如何进行索引？

这是我用来上传pdf文件的代码

ContentStreamUpdateRequest up 
            = new ContentStreamUpdateRequest("/update/extract");

            up.addFile(fileName);

            up.setParam("literal.id", solrId);
            up.setParam("literal.first_name", "apachesolr");
            up.setParam("literal.last_name", "cookbook");
            up.setParam("literal.age", "30");

            up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

            solrServer.request(up);

下面是我的schema.xml

    <field name="first_name" type="string" indexed="true" stored="true" required="true"/>

<field name="last_name" type="string" indexed="true" stored="true" required="true"/>
<field name="age" type="int" indexed="true" stored="true" required="true"/>

<field name="created_at" type="date" indexed="true" stored="true"/>
<field name="updated_at" type="date" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true"/>

当我搜索 pdf 中的任何内容时。结果看起来像这样

  SolrDocument[{
last_modified=Fri Oct 17 08:17:38 IST 2003, 
author=Mark Roth, Eduardo Pelegri-Llopart, 
title=[JSP 2.0 Specification, Final Release], 
content_type=[application/pdf], 
keywords=JSP, 
age=30, 
last_name=cookbook, 
first_name=apachesolr, 
id=jsp-2_0-fr-spec.pdf
}]

它将如何获得标题、作者、关键字...等？

score 4 · Accepted Answer

您误解了搜索引擎中文档的概念。文档是一组具有相应值的命名字段。您应该始终明确设置每个字段。首先，使用 Solrj 尝试以下代码：

CommonsHttpSolrServer solr = new CommonsHttpSolrServer("http://localhost:8983/solr");
for(int i = 0; i < 1000; ++i) {
  SolrInputDocument doc = new SolrInputDocument();
  doc.addField("title", "My Favorite book");
  doc.addField("author", "Kevin");
  doc.addField("content", "Bla bla bla");
  solr.add(doc);
}
solr.commit();

这段代码新建SolrInputDocument并添加了 3 个字段——“title”、“author”和“content”（注意：所有这些字段都应该在 schema.xml 中定义，只是为了让 Solr 知道如何索引和存储这些字段），然后它将新文档添加到事务 ( solr.add(doc)) 并最终提交更改。这是使用 Solr 的基本方式。

在这个正常流程中，您应该自己从文档中提取文本。例如，您可以为此目的使用Tika 。这是最灵活、最细粒度的方式。

您正在尝试做的是使用新的 Solr 功能-内容提取。如果我理解正确，您正在尝试设置setParams()错误的字段。setParams()只设置请求参数，然后将其转换为 URL 参数，让 Solr 知道如何自己处理请求。据我所知，这种方式你不能自己设置字段。相反，/update/extract处理程序将尝试按文件的 MIME 类型提取内容，查找有关文档属性的提示并将它们用作字段（请注意，Solr 使用 Tika 库来提取文档内容）。因此，如果您真的想使用/update/extract处理程序，请尝试在不更改的情况下遵循此示例与请求参数对应的行并检查生成的字段。

solr - 什么是 Apache solr 中的索引？

1 回答 1

Related

Reference