java - Java Lucene 停用词过滤器

Question

我有大约 500 个句子，我想在其中编译一组 ngram。我无法删除停用词。我尝试添加 lucene StandardFilter 和 StopFilter 但我仍然遇到同样的问题。这是我的代码：

for(String curS: Sentences)
{
          reader = new StringReader(curS);
          tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
          tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
          tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
          tokenizer = new ShingleFilter(tokenizer, 2, 3);
          charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    while(tokenizer.incrementToken())
    {
        curNGram = charTermAttribute.toString().toString();
        nGrams.add(curNGram);                   //store each token into an ArrayList
    }
}

例如，我要测试的第一个短语是：“对于每个倾听的人”。在此示例中，curNgram 设置为“For”，这是我的列表 stopWords 中的停用词。此外，在此示例中，“every”是停用词，因此“person”应该是第一个 ngram。

当我使用 StopFiler 时，为什么停用词会添加到我的列表中？

感谢所有帮助！

score 1 · Accepted Answer

你发布的内容对我来说看起来不错，所以我怀疑 stopWords 没有向过滤器提供你想要的信息。

尝试类似：

//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);

假设您生成的停用词列表（我将其命名为“单词”）看起来像您认为的那样，这应该将它们放入 StopFilter 可用的格式。

你已经在生成这样的 stopWords 了吗？

java - Java Lucene 停用词过滤器

1 回答 1

Related

Reference