使用NLTK
Unigram Tagger,我正在训练句子Brown Corpus
我尝试不同categories
,我得到的价值大致相同。对于0.9328
每个categories
诸如fiction
,romance
或humor
from nltk.corpus import brown
# Fiction
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209
# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324
为什么会这样?是因为他们来自同一个地方corpus
吗?还是他们的part-of-speech
标签是一样的?