java - 如何在 Lucene-3x 中通过模糊（近似）搜索找到分析的术语？

Question

查询“激光~ ”没有找到“激光”。

我正在使用 LuceneGermanAnalyzer将文档存储在索引中。我分别保存了两个带有“标题”字段“激光”和“劳动”的文档。之后我执行模糊查询laser~。Lucene 只查找包含“labor”的文档。实现此类搜索的 Lucene-3x 方法是什么？

通过查看 Lucene 源代码，我猜想模糊搜索并非旨在处理“已分析”的内容，但我不确定是否是这种情况。

接下来，一些背景和评论......

开放式管理系统

在最近有人注意到我们的 OpenCms 的搜索在结果页面中缺少文档后，我注意到了这种行为。在一些德国网站上搜索失败。调查了一下，我发现：

我们使用 OpenCms 8.5.1 来执行我们的搜索，这使用 Lucene 3.6.1 来实现搜索功能。
默认情况下，OpenCms 使用org.apache.lucene.analysis.de.GermanAnalyzer具有德语语言环境的站点来解析内容和查询。
我们将网站内容存储在Field.Index.ANALYZED
对于报告的失败搜索，我们通过在搜索查询中附加波浪号来强制进行模糊搜索。

示例代码

为了缩小问题的范围，我直接编写了一些代码来运行 Lucene 3.6.1（我也测试了 3.6.2，但两者的行为相同）。请注意，Lucene 4+ 的 API 和模糊搜索略有不同，也就是说，在 Lucene 4+ 中不会出现这个问题。（不幸的是，我无法控制 OpenCms 所依赖的 Lucene 版本。）

// For the import clauses, see below
public static void main(String[] args) throws Exception {
    final Version VER = Version.LUCENE_36;
    // With the StandardAnalyzer or the EnglishAnalyzer
    // the search works as expected
    Analyzer analyzer = new GermanAnalyzer(VER);

    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(VER, analyzer);

    IndexWriter w = new IndexWriter(index, config);
    addDoc(w, "labor");
    addDoc(w, "laser");
    addDoc(w, "latex");
    w.close();

    String querystr = "laser~"; // Fuzzy search for 'title'
    Query q = new QueryParser(VER, "title", analyzer).parse(querystr);
    System.out.println("Querystr: " + querystr + "; Query: " + q);

    int hitsPerPage = 10;
    IndexReader reader = IndexReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(
            hitsPerPage, true);
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title"));
    }
}

private static void addDoc(IndexWriter w, String title) throws Exception {
    Document doc = new Document();
    doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
    w.addDocument(doc);
}

此代码的输出：

Querystr: laser~; Query: title:laser~0.5 <br>
Found 2 hits.<br>
1. labor<br>
2. latex<br>

我特意删掉了导入部分，以免代码混乱。要构建项目，您需要lucene-core-3.6.2.jar, lucene-analyzers-3.6.2.jar（您可以从Apache 存档下载）和以下导入：

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

一些Lucene调试细节和备注

在调试 Lucene 代码时，我发现 LuceneGermanAnalyzer将索引中的文档标题存储为：
- “激光”->“激光”
- “劳动”->“劳动”
- “乳胶”->“乳胶”
我还发现，使用精确搜索laser，查询字符串也会被分析。上述laser查询代码的输出是：
```
Querystr: laser; Query: title:las
Found 1 hits.
1. laser
```
（请注意两次运行中的不同查询：title:laser~0.5第一次运行与title:las第二次运行。）
如前所述，使用StandardAnalyzer或EnglishAnalyzer模糊搜索按预期工作：
```
Querystr: laser~; Query: title:laser~0.5
Found 3 hits.
1. laser
2. labor
3. latex
```
org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)Lucene 计算两个术语（in ）之间相对于最短术语长度的相似度。Similarity返回：

[...]
1 - (editDistance / length)
其中 length 是最短术语（文本或目标）的长度，包括相同的前缀，editDistance 是两个单词的 Levenshtein 距离。

请注意：
```
similarity("laser","las"  ) = 1 - (2 / 3) = 1/3
similarity("laser","labor") = 1 - (2 / 5) = 3/5
```

编辑 1。从分析器中明确排除“激光”也会产生预期的搜索结果：

Analyzer analyzer = new GermanAnalyzer(VER, null, new HashSet() {
    {
        add("laser");
    }
});

输出：

Querystr: laser~; Query: title:laser~0.5
Found 3 hits.
1. laser
2. labor
3. latex

score 1 · Accepted Answer

事实证明^*在 3.6 分支之前，查询不会通过分析器（执行词干提取和小写的组件）。在 3.6 分支中，一些过滤器已添加到查询分析器链中（例如LowerCaseFilterFactory）。最后， GermanNormalizationFilterFactory已在 4.0 分支中添加到此链中。

^{* 感谢@femtoRgon的指点}

一篇较早的文章用一个例子解释了为什么模糊搜索没有通过分析器：

跳过分析器的原因是，如果您正在搜索“dogs*”，您不希望“dogs”首先成为“dog”，因为这将匹配“dog*”，这不是预期的查询。

底线是，如果继续使用 Lucene 3.6.2，用户必须自己实现查询分析。

java - 如何在 Lucene-3x 中通过模糊（近似）搜索找到分析的术语？

开放式管理系统

示例代码

一些Lucene调试细节和备注

1 回答 1

Related

Reference