solr - 如何使用自定义处理程序/插件在 Solr 服务器端更新 Solr 文档

Question

我有一个拥有数百万条记录的核心。
我想添加一个自定义处理程序，它扫描现有文档并根据条件（例如年龄> 12）更新其中一个字段。
我更喜欢在 Solr 服务器端这样做，以避免向客户端发送数百万个文档并返回。
我正在考虑编写一个 solr 插件，它将接收查询并更新查询文档上的某些字段（例如查询处理程序的删除）。
我想知道是否有现有的解决方案或更好的替代方案。
我在网上搜索了一段时间，找不到更新文档的 Solr 插件示例（我不需要扩展更新处理程序）。
我编写了一个插件，它使用以下代码，它工作正常，但没有我需要的那么快。
目前我做：

AddUpdateCommand addUpdateCommand = new AddUpdateCommand(solrQueryRequest); 
DocIterator iterator = docList.iterator(); 
SolrIndexSearcher indexReader = solrQueryRequest.getSearcher(); 
while (iterator.hasNext()) { 
   Document document = indexReader.doc(iterator.nextDoc()); 
   SolrInputDocument solrInputDocument = new SolrInputDocument(); 
   addUpdateCommand.clear(); 
   addUpdateCommand.solrDoc = solrInputDocument; 
   addUpdateCommand.solrDoc.setField("id", document.get("id")); 
   addUpdateCommand.solrDoc.setField("my_updated_field", new_value); 
   updateRequestProcessor.processAdd(addUpdateCommand); 
}

但这非常昂贵，因为更新处理程序将再次获取我手头已有的文档。
是否有一种安全的方法来更新 lucene 文档并将其写回，同时考虑所有与 Solr 相关的代码，例如缓存、额外的 solr 逻辑等？
我正在考虑将其转换为 SolrInputDocument，然后通过 Solr 添加文档，但我需要首先转换所有字段。
提前致谢，阿夫纳

score 0 · Accepted Answer

我不确定以下是否会提高性能，但认为它可能会对您有所帮助。

看看SolrEntityProcessor

它的描述听起来与您正在搜索的内容非常相关。

This EntityProcessor imports data from different Solr instances and cores. 
The data is retrieved based on a specified (filter) query. 
This EntityProcessor is useful in cases you want to copy your Solr index 
and slightly want to modify the data in the target index. 
In some cases Solr might be the only place were all data is available.

但是，我找不到嵌入您的逻辑的开箱即用功能。因此，您可能必须扩展以下类。

SolrEntityProcessor和源代码链接

您可能知道，但还有其他几点。

1）使整个过程利用所有可用的cpu内核。使其成为多线程。

2) 使用最新版本的 Solr。

3) 在不同的机器上以最小的网络延迟试验两个 Solr 应用程序。这将是一个艰难的决定：

same machine, two processes VS two machines, more cores, but network overhead.

4) 以适用于您的用例和特定实现的方式调整Solr 缓存。

5) 更多资源：Solr Performance Problems和SolrPerformanceFactors

希望能帮助到你。尽管有这个答案，请让我知道统计数据。我很好奇，你的信息可能会在以后帮助别人。

score 0 · Accepted Answer

为了指出在哪里放置自定义逻辑，我建议结合Solr 的 ScriptTransformer 来查看 SolrEntityProcessor。

ScriptTransformer 允许在从数据导入源中提取每个实体后计算每个实体，在将新实体写入 solr 之前对其进行操作并添加自定义字段值。

示例data-config.xml可能如下所示

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>

    <script>
    <![CDATA[
        function calculateValue(row)        {
            row.put("CALCULATED_FIELD", "The age is: " + row.get("age"));
            return row;
        }
    ]]>
    </script>

  <document>
    <entity name="sep" processor="SolrEntityProcessor" 
        url="http://localhost:8080/solr/your-core-name" 
        query="*:*"
        wt="javabin"
        transformer="script:calculateValue">
            <field column="ID" name="id" />
            <field column="AGE" name="age" />
            <field column="CALCULATED_FIELD" name="update_field" />
    </entity>
  </document>
</dataConfig>

如您所见，您可以执行您喜欢的任何数据转换，并且可以在 javascript 中表达。所以这将是表达你的逻辑和转换的好点。

你说可能有一个约束age > 12。我将通过querySolrEntityProcessor 的属性来处理这个问题。您可以编写query=age:[* TO 12]以便仅读取年龄不超过 12 的记录以进行更新。

solr - 如何使用自定义处理程序/插件在 Solr 服务器端更新 Solr 文档

2 回答 2

Related

Reference