我正在使用 solr 4.6.1 拼写检查组件来提供拼写建议。我将其配置为使用带有默认距离函数和比较器的 DirectSolrSpellChecker,据我了解,这意味着建议按编辑距离(主键)排序,然后是文档频率(次键)。
但是,对于 paper 一词,最重要的建议是papier,它的文档频率远低于paper。两种选择都与paper相距 1 编辑距离。
这是我不理解的编辑距离算法的错误还是怪癖?
我的拼写检查配置:
<!-- a spellchecker built from a field of the main index -->
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spellfield</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<!-- the spellcheck distance measure used, the default is the internal levenshtein -->
<str name="distanceMeasure">internal</str>
<!-- minimum accuracy needed to be considered a valid spellcheck suggestion -->
<float name="accuracy">0.5</float>
<!-- Sort Results by frequency -->
<str name="comparatorClass">score</str>
<!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -->
<int name="maxEdits">2</int>
<!-- the minimum shared prefix when enumerating terms -->
<int name="minPrefix">0</int>
<!-- maximum number of inspections per result. -->
<int name="maxInspections">5</int>
<!-- minimum length of a query term to be considered for correction -->
<int name="minQueryLength">3</int>
<!-- maximum threshold of documents a query term can appear to be considered for correction -->
<float name="maxQueryFrequency">0.01</float>
<!-- uncomment this to require suggestions to occur in 1% of the documents-->
<float name="thresholdTokenFrequency">2</float>
</lst>