1

使用NLTKUnigram Tagger,我正在训练句子Brown Corpus

我尝试不同categories,我得到的价值大致相同。对于0.9328每个categories诸如fiction,romancehumor

from nltk.corpus import brown


# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324

为什么会这样?是因为他们来自同一个地方corpus吗?还是他们的part-of-speech标签是一样的?

4

1 回答 1

1

看起来您正在训练,然后UnigramTagger在相同的训练数据上评估受过训练的人。查看nltk.tag的文档,特别是关于评估的部分。

使用您的代码,您将获得非常明显的高分,因为您的训练数据和评估/测试数据是相同的。如果你改变测试数据与训练数据不同的地方,你会得到不同的结果。我的例子如下:

类别:小说

在这里,我使用了训练集 asbrown.tagged_sents(categories='fiction')[:500]和测试/评估集 asbrown.tagged_sents(categories='fiction')[501:600]

from nltk.corpus import brown
import nltk

# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])

这给你一个〜0.7474610697359513的分数

类别:浪漫

在这里,我使用了训练集 asbrown.tagged_sents(categories='romance')[:500]和测试/评估集 asbrown.tagged_sents(categories='romance')[501:600]

from nltk.corpus import brown
import nltk

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])

这给你一个〜0.7046799354491662的分数

我希望这有助于并回答您的问题。

于 2020-03-03T15:45:19.827 回答