solr - SOLR Cell 如何添加文档内容？

Question

SOLR 有一个名为 Cell 的模块。它使用 Tika 从文档中提取内容并使用 SOLR 对其进行索引。

从https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction的来源，我得出结论，Cell 将提取的原始文本文档文本放入名为“内容”的字段中。该字段由 SOLR 索引，但不存储。当您查询文档时，“内容”不会出现。

我的 SOLR 实例没有架构（我保留了默认架构）。

UpdateRequestHandler我正在尝试使用默认值（POST to ）来实现类似的行为/solr/corename/update。POST 请求如下：

<add commitWithin="60000">
    <doc>
        <field name="content">lorem ipsum</field>
        <field name="id">123456</field>
        <field name="someotherfield_i">17</field>
    </doc>
</add>

以这种方式添加文档后，内容字段将被索引和存储。它出现在查询结果中。我不想这样；这是浪费空间。

关于 Cell 添加文档的方式，我缺少什么？

score 2 · Accepted Answer

如果您不希望您的字段存储内容，则必须将字段设置为 stored="false"。

由于您使用的是无模式模式（仍然存在模式，它只是在添加新字段时动态生成），因此您必须使用Schema API来更改字段。

您可以通过发出replace-field命令来做到这一点：

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
  "name":"content",
  "type":"text",
  "stored":false }
}' http://localhost:8983/solr/collection/schema

您可以通过针对发出请求来查看定义的字段/collection/schema/fields。

score 1 · Accepted Answer

Cell 代码确实将内容添加到文档中content，但是有一个内置的字段翻译规则替换content为_text_. 在无模式 SOLR 中，_text_标记为不用于存储。

该规则由以下行调用SolrContentHandler.addField()：

String name = findMappedName(fname);

在 params 对象中，有一条规则fmap.content应该被视为_text_. 它来自corename\conf\solrconfig.xml，默认情况下有以下片段：

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="fmap.meta">ignored_</str>
    <str name="fmap.content">_text_</str> <!-- This one! -->
  </lst>
</requestHandler>

同时，在 corename\conf\managed_schema 中有一行：

<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

这就是整个故事。

solr - SOLR Cell 如何添加文档内容？

2 回答 2

Related

Reference