2

Is there a way to use NLTK to get a set of possible parts of speech of a single string of letters, taking into account that different words might have homonyms?

For example: report -> {Noun, Verb} , kind -> {Adjective, Noun}

I have not been able to find a POS-tokenizer that tags part-of-speech for words outside of the context of a full sentence. This seems like a very basic request of NLTK, so I'm confused as to why I've had so much trouble finding it.

4

2 回答 2

5

是的。最简单的方法是不使用标注器,只需加载一个或多个语料库并收集您感兴趣的单词的所有标签集。如果您对多个单词感兴趣,收集标签是最简单的对于语料库中的所有单词,然后查找您想要的任何内容。我会添加频率计数,只是因为我可以。例如,使用布朗语料库和简单的“通用”标签集:

>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t) 
        for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']
于 2017-06-06T07:35:44.053 回答
4

因为 POS 模型是在基于句子/文档的数据上训练的,所以预训练模型的预期输入是句子/文档。当只有一个单词时,它会将其视为一个单词句子,因此在该单词句子上下文中应该只有一个标签。

如果您试图找到每个英语单词的所有可能的 POS 标签,您将需要一个包含许多不同使用单词的语料库,然后标记语料库并计算/提取编号。每个单词的标签数。例如

>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]

>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]

>>> for word, pos in chain(*tagged_sents):
...     counts[word][pos] += 1
... 

>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})

>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})

或者,还有 WordNet:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})

但请注意,WordNet 是一种手工制作的资源,因此您不能期望每个英语单词都包含在其中。

于 2017-06-06T07:22:33.897 回答