java - 使用 OpenNLP 从解析的内容中删除停用词

Question

我已经使用此链接中提供的 OpenNLP 解析器代码解析了文档，并得到以下输出：

(TOP (S (NP (NN Programcreek)) (VP (VBZ is) (NP (DT a) (ADJP (RB very) (JJ huge) (CC and) (JJ useful)) (NN website)))))

从中我想只提取有意义的词，这意味着我想删除所有停用词，因为我想根据这些有意义的词进一步分类。你能建议我如何从解析的输出中删除停用词吗？

最后我想得到下面的输出

   (TOP (S (NP (NN Programcreek)) (JJ useful)) (NN website)))))

请帮我解决这个问题，如果 OpenNLP 无法实现，请向我推荐任何其他用于自然语言处理的 Java 库。因为我的主要目标是解析文档并仅获取有意义的单词。

score 5 · Accepted Answer

OpenNLP 似乎不支持此功能。您必须按照 Olena Vikariy 的建议去做并自己实现它，或者在 Java 中使用不同的 NLP 库，例如 Mallet。

Java中删除停用词的实现如下（不需要排序）：

String testText = "This is a text you want to test";
String[] stopWords = new String[]{"a", "able", "about", "above", "according", "accordingly", "across", "actually", "after", "afterwards", "again", "against", "all"};
String stopWordsPattern = String.join("|", stopWords);
Pattern pattern = Pattern.compile("\\b(?:" + stopWordsPattern + ")\\b\\s*", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(testText);
testText = matcher.replaceAll("");

您可以使用此英语停用词列表。

或者使用 Mallet，您必须按照此处的教程进行操作。移除停用词的部分是使用 Pipe 来定义的：

pipeList.add(new TokenSequenceRemoveStopwords(false, false));

Mallet 包含一个停用词列表，因此您无需定义它们，但如果需要，它也可以扩展。

希望这可以帮助。

score 3 · Accepted Answer

在将文本传递给 OpenNLP 之前，您可以轻松地从文本中删除所有停用词。

将停用词存储在数组中
按字长对数组进行排序，以避免在“没有”之前删除“did”并以“n't”结尾的问题
使用正则表达式删除所有单词，确保忽略大小写并仅删除整个单词

以下是您可以采用 Java 的 .NET 中的操作方法。

public string CleanStopWords(string inputText)
{
    string[] stopWords = new string[] { 
        "a", "all", "am", "an", "and", "any", "are", "aren't", 
        "as", "at", "be", "because", "been", "to", "from", "by", 
        "can", "can't", "do", "don't", "didn't", "did" };

    stopWords = stopWords.OrderByDescending(w => w.Length).ToArray();

    string outputText = Regex.Replace(inputText, "\\b" + string.Join("\\b|\\b", stopWords) + "\\b", "", RegexOptions.IgnoreCase);

    return outputText;
}

java - 使用 OpenNLP 从解析的内容中删除停用词

2 回答 2

Related

Reference