java - 通过斯坦福解析器提取所有名词、形容词形式和文本

Question

我正在尝试通过斯坦福解析器从给定文本中提取所有名词和形容词。

我目前的尝试是在 Tree-Object 的 getChildrenAsList() 中使用模式匹配来定位以下内容：

(NN paper), (NN algorithm), (NN information), ...

并将它们保存在一个数组中。

输入语句：

在本文中，我们提出了一种从任意文本中提取语义信息的算法。

结果 - 字符串：

[(S (PP (IN In) (NP (DT this) (NN paper))) (NP (PRP we)) (VP (VBP present) (NP (NP (DT an) (NN algorithm)) (SBAR (WHNP (WDT that)) (S (VP (VBD extracts) (NP (JJ semantic) (NN information)) (PP (IN from) (NP (DT an) (ADJP (JJ arbitrary)) (NN text)))))))) (. .))]

我尝试使用模式匹配，因为我在斯坦福解析器中找不到返回所有单词类（例如名词）的方法。

有没有更好的方法来提取这些词类或者解析器是否提供特定的方法？

public static void main(String[] args) {
    String str = "In this paper we present an algorithm that extracts semantic information from an arbitrary text.";
    LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz"); 
    Tree parseS = (Tree) lp.apply(str);
    System.out.println("tr.getChildrenAsList().toString()"+ parseS.getChildrenAsList().toString());
    }
}

score 6 · Accepted Answer

顺便说一句，如果你想要的只是名词和动词等词性，你应该只使用词性标注器，比如斯坦福 POS 标注器。它会更快地运行几个数量级，并且至少同样准确。

但是你可以用解析器来做。您想要的方法是taggedYield()返回一个List<TaggedWord>. 所以你有了

List<TaggedWord> taggedWords = (Tree) lp.apply(str);
for (TaggedWord tw : taggedWords) {
  if (tw.tag().startsWith("N") || tw.tag().startsWith("J")) {
    System.out.printf("%s/%s%n", tw.word(), tw.tag());
  }
}

（这种方法走捷径，知道在 Penn 树库标签集中所有且只有形容词和名词标签以 J 或 N 开头。您可以更一般地检查一组标签中的成员资格。）

ps 使用标签 stanford-nlp 最适合 stackoverflow 上的斯坦福 NLP 工具。

score 1 · Accepted Answer

我相信你会知道 nltk（自然语言工具包）只需安装这个 python 库和 maxent pos 标记器，下面的代码就可以解决问题。标注器已在 Penn 上接受过培训，因此标签没有什么不同。上面的代码不是，但我喜欢 nltk，因此。

    import nltk
    nouns=[]
    adj=[]
     #read the text into the variable "text"
    text = nltk.word_tokenize(text)
    tagged=nltk.pos_tag(text)
    for i in tagged:
      if i[1][0]=="N":
        nouns+=[i[0]]
      elif i[1][0]=="J":
        adj+=[i[0]]

java - 通过斯坦福解析器提取所有名词、形容词形式和文本

2 回答 2

Related

Reference