machine-learning - 为什么使用多项朴素贝叶斯分类器对正类和负类进行几乎相同的前 10 个特征？

Question

多次运行 MultinomialNB 后，我获得了 +ve 和 -ve 类 BoW、TfIdf 的相同功能。我什至在二元组和三元组上尝试过，这两个类的功能仍然相同。

best_alpha = 6
clf = MultinomialNB( alpha=best_alpha )
clf.fit(X_tr, y_train)

y_train_pred = batch_predict(clf, X_tr)    
y_test_pred = batch_predict(clf, X_te)

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

这是获取文本数据 Tf-Idf 的正面和负面类别的前 10 个特征的代码。 feats_tfidf包含分类、数字和文本数据的特征。

对于正类

sorted_idx = np.argsort( clf.feature_log_prob_[1] )[-10:]

for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[1][ sorted_idx ]):
print('{:45}:{}'.format(p,q))

输出：

Mathematics                                  :-7.134937347073638
Literacy                                     :-6.910334729871051
Grades_3_5                                   :-6.832969821702653
Ms                                           :-6.791634814736902
Math_Science                                 :-6.748584860699069
Grades_PreK_2                                :-6.664767807632341
Literacy_Language                            :-6.4833650280402875
Mrs                                          :-6.404885953106168
Teacher number of previously posted projects :-3.285663623429455
price                                        :-0.09775430166978438

对于负类

sorted_idx = np.argsort( clf.feature_log_prob_[0] )[-10:]

for p,q in zip(feats_tfidf[ sorted_idx ], clf.feature_log_prob_[0][ sorted_idx ]):
print('{:45}:{}'.format(p,q))

输出：

Literacy                                     :-7.31906682336635
Mathematics                                  :-7.318545582802034
Grades_3_5                                   :-7.088236519755028
Ms                                           :-6.970453484098645
Math_Science                                 :-6.887189615718408
Grades_PreK_2                                :-6.85882128589294
Literacy_Language                            :-6.8194613665941155
Mrs                                          :-6.648860662073821
Teacher number of previously posted projects :-4.008908256269724
price                                        :-0.08131982830664697

请帮助我，这是正确的做法。

score 0 · Accepted Answer

应该是这样 sorted_idx = np.argsort(-1 * clf_bow.feature_log_prob_[0] )[0:11] for i in sorted_idx: print(count_vect.get_feature_names()[i])

当您说 [-10:] 时，您将在 (n-10), (n-9)....n 位置打印元素，但我们希望打印的元素是 n, n-1, n-2, ... n-10

score 0 · Accepted Answer

我正在解决同样的问题，是的，我也有许多在这两个类中共有的顶级功能，尽管它与你的顺序不完全相同。

我是这样做的——我首先将所有特征和概率值（对数概率的指数）链接在一起，然后按降序排序。

前 10 个正类特征

前 10 个负类特征

所以是的，我认为你得到的是正确的。

machine-learning - 为什么使用多项朴素贝叶斯分类器对正类和负类进行几乎相同的前 10 个特征？

2 回答 2

Related

Reference