python - 用 python 的 NLTK 计算动词、名词和其他词性

Question

我有多个文本，我想根据他们对不同词性（如名词和动词）的使用来创建它们的配置文件。基本上，我需要计算每个词性使用了多少次。

我已经标记了文本，但不知道如何进一步：

tokens = nltk.word_tokenize(text.lower())
text = nltk.Text(tokens)
tags = nltk.pos_tag(text)

如何将每个词性的计数保存到变量中？

score 34 · Accepted Answer

该pos_tag方法为您返回一个 (token, tag) 对的列表：

tagged = [('the', 'DT'), ('dog', 'NN'), ('sees', 'VB'), ('the', 'DT'), ('cat', 'NN')]

如果您使用的是 Python 2.7 或更高版本，那么您可以简单地使用：

>>> from collections import Counter
>>> counts = Counter(tag for word,tag in tagged)
>>> counts
Counter({'DT': 2, 'NN': 2, 'VB': 1})

要标准化计数（给你每个的比例），请执行以下操作：

>>> total = sum(counts.values())
>>> dict((word, float(count)/total) for word,count in counts.items())
{'DT': 0.4, 'VB': 0.2, 'NN': 0.4}

请注意，在旧版本的 Python 中，您必须Counter自己实现：

>>> from collections import defaultdict
>>> counts = defaultdict(int)
>>> for word, tag in tagged:
...  counts[tag] += 1

>>> counts
defaultdict(<type 'int'>, {'DT': 2, 'VB': 1, 'NN': 2})

python - 用 python 的 NLTK 计算动词、名词和其他词性

1 回答 1

Related

Reference