python - Python如何训练朴素贝叶斯分类器

Question

我需要一个分类器来将评论分类为正面或负面。对于每个文档，我已经完成了停用词过滤和词形还原，并计算了每个术语的 tf-idf 并将它们存储到 doc_bow 中，如下所示为每个文档。

doc_bow.append((term,tfidf)).

现在，我想训练分类器，但我不知道该怎么做。我从http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/找到了一个例子，但我仍然无法理解。td-idf 将如何使用或影响分类器？

score 0 · Accepted Answer

我对这方面知之甚少，但我可以分享我所了解的。如果我错了，请纠正我。从我从链接中看到的，没有提到使用 tf-idf 分数进行分类。您应该查看链接以了解如何使用朴素贝叶斯分类器。一般来说，代码看起来像这样（我从那个链接中获取了这个代码段）

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def word_feats(words):
    return dict([(word, True) for word in words])

negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))

classifier = NaiveBayesClassifier.train(trainfeats)

每个训练实例都是特征字典和类标签的元组，例如，它可以是({"sucks":True, "bad":True, "boring":True}, "Negative")

至于数字属性，我认为一种常见的方法是将它们分类为低/中/高等类别。

关于TF-IDF分数，我不是很确定。我认为一种方法是它们可用于功能选择，例如，如果您没有。of features 太大，您可以将前 n 个单词作为特征。

python - Python如何训练朴素贝叶斯分类器

1 回答 1

Related

Reference