list - NLTK在python / RSS feed分块中将子树变成列表

Question

使用下面的代码，我正在分块一个已经标记和标记化的 rss 提要。“print subtree.leaves()”正在输出：

[('Prime', 'NNP'), ('Minister', 'NNP'), ('Stephen', 'NNP'), ('Harper', 'NNP')] [('US', 'NNP' ), ('President', 'NNP'), ('Barack', 'NNP'), ('Obama', 'NNP')] [('what\', 'NNP')] [('Keystone', 'NNP'), ('XL', 'NNP')] [('CBC', 'NNP'), ('News', 'NNP')]

这看起来像一个 python 列表，但我不知道如何直接访问它或迭代它。我认为这是一个子树输出。

我希望能够把这个子树变成一个我可以操作的列表。是否有捷径可寻？这是我第一次在 python 中遇到树，我迷路了。我想以这个列表结束：

docs = [“总理斯蒂芬哈珀”，“美国总统巴拉克奥巴马”，“什么\”，“Keystone XL”，“CBC新闻”]

有没有一种简单的方法可以做到这一点？

谢谢，一如既往的帮助！

grammar = r""" Proper: {<NNP>+} """

cp = nltk.RegexpParser(grammar)
result = cp.parse(posDocuments)
nounPhraseDocs.append(result) 

for subtree in result.subtrees(filter=lambda t: t.node == 'Proper'):
# print the noun phrase as a list of part-of-speech tagged words

    print subtree.leaves()
print" "

score 6 · Accepted Answer

node现在已经被替换了label。所以修改维克多的回答：

docs = []

for subtree in result.subtrees(filter=lambda t: t.label() == 'Proper'):
    docs.append(" ".join([a for (a,b) in subtree.leaves()]))

Proper这将为您提供仅属于夹头的那些令牌的列表。您可以filter从方法中删除参数subtrees()，您将获得属于树的特定父级的所有标记的列表。

score 1 · Accepted Answer

docs = []

for subtree in result.subtrees(filter=lambda t: t.node == 'Proper'):
    docs.append(" ".join([a for (a,b) in subtree.leaves()]))

print docs

这应该可以解决问题。

list - NLTK在python / RSS feed分块中将子树变成列表

2 回答 2

Related

Reference