lucene - 在 Solr 中，为什么 'built' 不被限制为 'build' 而 'building' 是？

Question

我试图在这篇文章中弄清楚两件事：

为什么即使字段类型定义定义了词干分析器，“构建”也不会被称为“构建”。然而，“建设”正被限制为“建设”
如何使用 Luke 来检查索引以查看哪些词被词干了，哪些词是词干的。我无法在卢克中看到“建筑”被阻止“建造”。我知道 Lucene 正在阻止它，因为我能够通过搜索“build”成功地检索到带有“building”的行。

这个链接很有帮助，但没有回答我的问题。

作为参考，这里是 schema.xml 部分。

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal.
      add enablePositionIncrements=true in both the index and query
      analyzers to leave a 'gap' for more accurate phrase queries.
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

并且字段定义是

<field name="features" type="text_en" indexed="true" stored="true" multiValued="true"/>

数据集由多个文档组成，1 个文档在 features 字段中具有“building”，1 个文档在同一字段中具有“built”，1 个文档在 features 字段中具有“Built-in”：

文件：hd.xml：

<field name="features">building NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor</field>

文件 ipod_video.xml：

<field name="features">Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication</field>

文件 sd500.xml：

 <field name="features">built in flash, red-eye reduction</field>

使用 Lukeall-3.3.0，这是我通过搜索“功能：构建”得到的结果。请注意，我得到了 1 个（而不是预期的 3 个文档）在此处输入图像描述即使在那个文档中，我也看不到词干，即我只看到原始单词“building”，如图所示：

并且，再次在 Luke 中搜索“features:built”，返回两个文档：在此处输入图像描述

选择其中之一，会显示原始的“已构建”，但不会显示“构建”。在此处输入图像描述

score 2 · Accepted Answer

2

对于这种特殊情况，您可以使用StemmerOverrideFilter调整词干算法

于 2011-08-18T02:55:34.483 回答

lucene - 在 Solr 中，为什么 'built' 不被限制为 'build' 而 'building' 是？

1 回答 1

Related

Reference