solr - 如何在 Solr 上的多值字段中提升重复值

Question

我的 solr 索引的multiValue字段中有一些重复的（相同的字符串）数据，我想通过该字段中的匹配计数来提升文档。例如：

doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }

当我运行查询时，q=locales:en_US我希望在顶部看到 doc1，因为它有两个“en_US”值。提升此类数据的正确方法是什么？

我应该使用特殊的标记器吗？

Solr 版本为：4.5

score 4 · Accepted Answer

免责声明

为了使用以下任一解决方案，您需要进行以下任一更改：

为locales创建一个 copyField ：

<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>

将语言环境的类型更改为“text_general”（该类型在标准 solr 集合中提供）

第一个解决方案（订购）：

结果可以按某些函数排序。所以我们可以在字段中按出现次数（termfreq 函数）排序：

如果使用了 copyField，那么排序查询将是：termfreq(locales_text,'en_US') DESC
如果 locales 是 text_general 类型，那么排序查询将是：termfreq(locales,'en_US') DESC

copyField 选项的示例响应（对于 text_general 类型的结果相同）：

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="fl">*,score</str>
    <str name="sort">termfreq(locales_text,'en_US') DESC</str>
    <str name="indent">true</str>
    <str name="q">locales:en_US</str>
    <str name="_">1383598933337</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
  <doc>
    <arr name="locales">
      <str>en_US</str>
      <str>de_DE</str>
      <str>fr_FR</str>
      <str>en_US</str>
    </arr>
    <str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
    <long name="_version_">1450808563062538240</long>
    <float name="score">0.4203996</float></doc>
  <doc>
    <arr name="locales">
      <str>en_US</str>
    </arr>
    <str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
    <long name="_version_">1450808391856291840</long>
    <float name="score">0.5945348</float></doc>
</result>
</response>

您还可以使用fl=*,termfreq(locales_text,'en_US')查看匹配数。

要记住的一件事 - 它是一个订单功能，而不是一个提升功能。如果您更愿意根据多个匹配来提高分数，那么您可能会对第二种解决方案更感兴趣。

我将分数包含在结果中以证明 @arun 所说的内容。你可以看到分数是不同的（可能是长度）......非常出乎意料（对我来说）对于多值字符串它是相同的。

第二种解决方案（提升）：

如果使用了 copyField，那么查询将是：{!boost b=termfreq(locales_text,'en_US')}locales:en_US
如果 locales 是 text_general 类型，那么查询将是：{!boost b=termfreq(locales,'en_US')}locales:en_US

copyField 选项的示例响应（对于 text_general 类型的结果相同）：

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="lowercaseOperators">true</str>
    <str name="fl">*,score</str>
    <str name="indent">true</str>
    <str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
    <str name="_">1383599910386</str>
    <str name="stopwords">true</str>
    <str name="wt">xml</str>
    <str name="defType">edismax</str>
  </lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
  <doc>
    <arr name="locales">
      <str>en_US</str>
      <str>de_DE</str>
      <str>fr_FR</str>
      <str>en_US</str>
    </arr>
    <str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
    <long name="_version_">1450808563062538240</long>
    <float name="score">1.1890696</float></doc>
  <doc>
    <arr name="locales">
      <str>en_US</str>
    </arr>
    <str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
    <long name="_version_">1450808391856291840</long>
    <float name="score">0.5945348</float></doc>
</result>
</response>

您可以看到分数发生了显着变化。第一个文档的得分是第二个的两倍（因为有两个匹配，每个得分为 0.5945348）。

第三种解决方案（omitNorms=false）

根据@arun 的回答，我认为还有第三种选择。

如果您将字段转换为（例如）并为该字段text_general设置- 它应该具有相同的结果。omitNorms=true

score 0 · Accepted Answer

Solr 中的默认标准请求处理程序不仅仅使用术语频率来计算分数。除了词频，它还使用字段的长度。请参阅lucene 评分算法，其中说：

lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.

由于 doc2 的字段较短，因此它的得分可能更高。在查询中检查结果的分数fl=*,score。要了解 Solr 是如何得出分数的，请使用fl=*,score&wt=xml&debugQuery=on（然后右键单击浏览器并查看页面源代码以查看正确缩进的分数计算）。我相信您会看到 lengthNorm 导致 doc1 得分较低。

要使字段长度不影响分数，您需要禁用它。omitNorms=true为该字段设置。（参考：http ://wiki.apache.org/solr/SchemaXml ）然后看看分数是多少。

solr - 如何在 Solr 上的多值字段中提升重复值

2 回答 2

免责声明

第一个解决方案（订购）：

第二种解决方案（提升）：

第三种解决方案（omitNorms=false）

Related

Reference