lucene - lucene 中的 JarowinklerDistance 返回奇怪的结果

Question

我有一个包含一些短语的文件。使用 lucene 的 jarowinkler，它应该让我从该文件中获得与我的输入最相似的短语。

这是我的问题的一个例子。

我们有一个文件包含：

//phrases.txt
this is goodd
this is good
this is god

如果我的输入是这很好，它应该首先让我从文件中得到“这很好”，因为这里的相似度得分是最大的 (1)。但由于某种原因，它只返回：“this is goodd”和“this is god”！

这是我的代码：

try {
    SpellChecker spellChecker = new SpellChecker(new RAMDirectory(), new JaroWinklerDistance());
    Dictionary dictionary = new PlainTextDictionary(new File("src/main/resources/words.txt").toPath());
    IndexWriterConfig iwc=new IndexWriterConfig(new ShingleAnalyzerWrapper());
    spellChecker.indexDictionary(dictionary,iwc,false);

    String wordForSuggestions = "this is good";

    int suggestionsNumber = 5;

    String[] suggestions = spellChecker.suggestSimilar(wordForSuggestions, suggestionsNumber,0.8f);
    if (suggestions!=null && suggestions.length>0) {
        for (String word : suggestions) {
            System.out.println("Did you mean:" + word);
        }
    }
    else {
        System.out.println("No suggestions found for word:"+wordForSuggestions);
    }
} catch (IOException e) {
    e.printStackTrace();
}

score 1 · Accepted Answer

suggestSimilar不会提供与输入相同的建议。引用源代码：

// 不要为自己建议一个词，那会很愚蠢

如果你想知道是否wordForSuggestions在字典中，使用exist方法：

if (spellChecker.exist(wordForSuggestions)) {
    //do what you want for an, apparently, correctly spelled word
}

lucene - lucene 中的 JarowinklerDistance 返回奇怪的结果

1 回答 1

Related

Reference