lucene - 获取 ngram 频率时，Lucene 输出中带有停用词的下划线

Question

我目前正在为用户提供在过滤 ngram 频率的文本正文时是否包含停用词的选项。通常，这样做如下：

snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English", stopWords);               
shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer, this.getnGramLength());

stopWords 设置为要包含在 ngram 中或从中删除的完整单词列表。this.getnGramLength()); 仅包含当前 ngram 长度，最多为三个。

如果我在为 trigrams 过滤文本“卫星肯定掉到地球”中使用停用词，则输出为：

No=1, Key=to, Freq=1
No=2, Key=definitely, Freq=1
No=3, Key=falling to earth, Freq=1
No=4, Key=satellite, Freq=1
No=5, Key=is, Freq=1
No=6, Key=definitely falling to, Freq=1
No=7, Key=definitely falling, Freq=1
No=8, Key=falling, Freq=1
No=9, Key=to earth, Freq=1
No=10, Key=satellite is, Freq=1
No=11, Key=is definitely, Freq=1
No=12, Key=falling to, Freq=1
No=13, Key=is definitely falling, Freq=1
No=14, Key=earth, Freq=1
No=15, Key=satellite is definitely, Freq=1

但是，如果我不对三元组使用停用词，则输出是这样的：

No=1, Key=satellite, Freq=1
No=2, Key=falling _, Freq=1
No=3, Key=satellite _ _, Freq=1
No=4, Key=_ earth, Freq=1
No=5, Key=falling, Freq=1
No=6, Key=satellite _, Freq=1
No=7, Key=_ _, Freq=1
No=8, Key=_ falling _, Freq=1
No=9, Key=falling _ earth, Freq=1
No=10, Key=_, Freq=3
No=11, Key=earth, Freq=1
No=12, Key=_ _ falling, Freq=1
No=13, Key=_ falling, Freq=1

为什么我看到下划线？我会想到看到简单的一元组，“卫星坠落”，“坠落地球”和“卫星坠落地球”？绝对在我使用的停用词集中。

我可以用下划线过滤掉结果，但是......

score 3 · Accepted Answer

下划线表示“缺失的停用词/s”。为避免这种行为，您应该将其设置enablePositionIncrements为false但SnowballAnalyzer（现在在 4.0.0-Beta 中已弃用）不允许您这样做。

一种解决方案是首先使用没有停用词的 StandardAnalyzer，然后使用、和StopFilter装饰SnowballFilter输出ShingleFilter。Lucene 4.0.0-Beta 中的二元语法示例：

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40, CharArraySet.EMPTY_SET);
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(input));
StopFilter stopFilter = new StopFilter(Version.LUCENE_40, tokenStream, stopWords);
stopFilter.setEnablePositionIncrements(false);
SnowballFilter snowballFilter = new SnowballFilter(stopFilter, "English");
ShingleFilter bigramShingleFilter = new ShingleFilter(snowballFilter, 2, 2);

希望这能让你走上正轨！

编辑

Lucene v4.4+ 不再可能，仍在寻找一个不错的替代方案......

lucene - 获取 ngram 频率时，Lucene 输出中带有停用词的下划线

1 回答 1

Related

Reference