nlp - NLTK 中的 Unigram 标记

Question

使用NLTKUnigram Tagger，我正在训练句子Brown Corpus

我尝试不同categories，我得到的价值大致相同。对于0.9328每个categories诸如fiction,romance或humor

from nltk.corpus import brown


# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324

为什么会这样？是因为他们来自同一个地方corpus吗？还是他们的part-of-speech标签是一样的？

score 1 · Accepted Answer

看起来您正在训练，然后UnigramTagger在相同的训练数据上评估受过训练的人。查看nltk.tag的文档，特别是关于评估的部分。

使用您的代码，您将获得非常明显的高分，因为您的训练数据和评估/测试数据是相同的。如果你改变测试数据与训练数据不同的地方，你会得到不同的结果。我的例子如下：

类别：小说

在这里，我使用了训练集 asbrown.tagged_sents(categories='fiction')[:500]和测试/评估集 asbrown.tagged_sents(categories='fiction')[501:600]

from nltk.corpus import brown
import nltk

# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])

这给你一个〜0.7474610697359513的分数

类别：浪漫

在这里，我使用了训练集 asbrown.tagged_sents(categories='romance')[:500]和测试/评估集 asbrown.tagged_sents(categories='romance')[501:600]

from nltk.corpus import brown
import nltk

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])

这给你一个〜0.7046799354491662的分数

我希望这有助于并回答您的问题。

nlp - NLTK 中的 Unigram 标记

1 回答 1

Related

Reference