因为 POS 模型是在基于句子/文档的数据上训练的,所以预训练模型的预期输入是句子/文档。当只有一个单词时,它会将其视为一个单词句子,因此在该单词句子上下文中应该只有一个标签。
如果您试图找到每个英语单词的所有可能的 POS 标签,您将需要一个包含许多不同使用单词的语料库,然后标记语料库并计算/提取编号。每个单词的标签数。例如
>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]
>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]
>>> for word, pos in chain(*tagged_sents):
... counts[word][pos] += 1
...
>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})
>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})
或者,还有 WordNet:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})
但请注意,WordNet 是一种手工制作的资源,因此您不能期望每个英语单词都包含在其中。