scikit-learn - 使用更多的 n-gram 顺序如何降低多项 NaiveBayes 分类器的准确性？

Question

我正在使用 nltk 和构建文本分类模型，并sklearn在 20newsgroups 数据集sklearn（每个文档大约 130 个单词）上对其进行训练。

我的预处理包括删除停用词和词形还原标记。

接下来，在我的管道中，我将它传递给tfidfVectorizer()并希望操纵矢量化器的一些输入参数以提高准确性。我读过n-grams（通常，n小于提高准确性，但是当我用分类器对矢量化器输出进行分类时，在tfidf中multinomialNB()使用ngram_range=(1,2)and ，它会降低准确性。有人可以帮忙解释为什么吗？ngram_range=(1,3)

编辑：这是请求的示例数据，其中包含我用来获取它并剥离标题的代码：

from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all', remove="headers")
#example of data text (no header)
print(news.data[0])

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game.          PENS RULE!!!

这是我的管道，运行代码来训练模型和打印精度：

    test1_pipeline=Pipeline([('clean', clean()),
                         ('vectorizer', TfidfVectorizer(ngram_range=(1,2))), 
                         ('classifier', MultinomialNB())])

train(test1_pipeline, news_group_train.data, news_group_train.target)

scikit-learn - 使用更多的 n-gram 顺序如何降低多项 NaiveBayes 分类器的准确性？

0 回答 0

Related

Reference