1

我正在使用 solr 4.6.1 拼写检查组件来提供拼写建议。我将其配置为使用带有默认距离函数和比较器的 DirectSolrSpellChecker,据我了解,这意味着建议按编辑距离(主键)排序,然后是文档频率(次键)。

但是,对于 paper 一词最重要的建议是papier,它的文档频率远低于paper两种选择都与paper相距 1 编辑距离。

这是我不理解的编辑距离算法的错误还是怪癖?

我的拼写检查配置:

<!-- a spellchecker built from a field of the main index -->
<lst name="spellchecker">
  <str name="name">default</str>
  <str name="field">spellfield</str>
  <str name="classname">solr.DirectSolrSpellChecker</str>
  <!-- the spellcheck distance measure used, the default is the internal levenshtein -->
  <str name="distanceMeasure">internal</str>
  <!-- minimum accuracy needed to be considered a valid spellcheck suggestion -->
  <float name="accuracy">0.5</float>
  <!-- Sort Results by frequency -->
  <str name="comparatorClass">score</str>
  <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -->
  <int name="maxEdits">2</int>
  <!-- the minimum shared prefix when enumerating terms -->
  <int name="minPrefix">0</int>
  <!-- maximum number of inspections per result. -->
  <int name="maxInspections">5</int>
  <!-- minimum length of a query term to be considered for correction -->
  <int name="minQueryLength">3</int>
  <!-- maximum threshold of documents a query term can appear to be considered for correction -->
  <float name="maxQueryFrequency">0.01</float>
  <!-- uncomment this to require suggestions to occur in 1% of the documents-->
  <float name="thresholdTokenFrequency">2</float>
</lst>
4

0 回答 0