我正在尝试使用朴素贝叶斯分类器比较两个特征选择、TF-IDF 和信息增益(互信息)之间的准确性。
对于 Tf-idf 我这样做了:
tfidf_vector = TfidfVectorizer(analyzer=lambda x:x)
tfidf_vector.fit_transform(Token)
doc_array2 = tfidf_vector.transform(Token).toarray()
frequency_matrix_tfidf = pd.DataFrame(doc_array2,columns=tfidf_vector.get_feature_names())
df3 = frequency_matrix_tfidf
df3.insert(len(df3.columns), 'Sentimen', df1['Sentimen'])
这就是我使用train_test_split
和使用数据框MultinomialNB
对其进行分类之前的样子
ad addict ade adik ah aja ajar ak akses aktif ... warga wkwk wkwkw wkwkwk x ya yaa yg yuk Sentimen
0 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0
1 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0
2 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.251779 0.0 0.0 0.0 0
3 0.0 0.0 0.000000 0.0 0.194082 0.122158 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0
4 0.0 0.0 0.239806 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0
但我不知道如何为互信息做这件事,因为我从互信息中得到的输出与 tf-idf 不同(上面的例子)。它返回一维数组
array([0. , 0. , 0. , 0.05187983, 0. ,
0. , 0. , 0. , 0.08828866, 0. ,
0. , 0. , 0. , 0. , 0.11530285,
0. , 0.08021988, 0.05897961, 0. , 0.06824362,
0. , 0. , 0.02786951, 0. , 0.05014545,
0.28257764, 0. , 0.00984759, 0. , 0.04362618,
0. , 0. , 0. , 0. , 0. ,
0.15165016, 0.01021197, 0.06610714, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.01686709, 0. , 0. , 0.18940527, 0.115353 ,
0.09879918, 0.19144364, 0.064697 , 0.06547344, 0. ,
0. , 0. , 0. , 0.47194838, 0. ,
0. , 0. , 0.10342815, 0.03847181, 0.04500324,
0. , 0.25270658, 0.36717759, 0. , 0. ,
0. , 0. , 0.04925613, 0.03009996, 0. ,
0. , 0. , 0. , 0. , 0. ,
0.02816479, 0. , 0.18201676, 0. , 0. ,
0. , 0. , 0. , 0. , 0.0052056 ,
0. , 0.04160016, 0.11510562, 0. , 0.09763579,
0.08849817, 0. , 0. , 0. , 0.02353365,
0. , 0. , 0.03959714, 0. , 0.03214612,
0.19341475, 0.11260033, 0. , 0. , 0.00128479,
0. , 0.07341715, 0.00729505, 0. , 0.1281784 ,
0.22364735, 0. , 0. , 0.3281854 , 0. ,
0. , 0. , 0.04169775, 0.02608552, 0.02171819,
0.06591236, 0. , 0.03454681, 0. , 0.12895553,
0.02310305, 0.09715215, 0.12950234, 0.08790128, 0.06153182,
0. , 0. , 0. , 0.0818714 , 0.05503847,
0. , 0.0026008 , 0.12831081, 0.0441718 , 0.2112707 ,
0. , 0. , 0.08382308, 0.02858223, 0. ,
0. , 0.25151498, 0.06671354, 0. , 0.10150897,
0.11968319, 0.11681159, 0. , 0.06950559, 0.05414106,
0.13507679, 0.02147254, 0. , 0.09186146, 0. ,
0.04002647, 0.12623272, 0. , 0. , 0. ,
0. , 0.03283483, 0.01362932, 0.05143286, 0. ,
0.12247352, 0. , 0. , 0. , 0.05200576,
0. , 0. , 0.15432282, 0.10984263, 0. ,
0. , 0.1123998 , 0. , 0. , 0.15091267,
0. , 0. , 0.07071549, 0. , 0.08633096,
0. , 0.05164792, 0. , 0.30434291, 0. ,
0.10498175, 0. , 0.08335206, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0.1173136 , 0.06708513, 0. ,
0. , 0. , 0. , 0.05603758, 0. ,
0.07544079, 0.03250152, 0. , 0.02241439, 0. ,
0. , 0.40283826, 0.05634349, 0. , 0. ,
0.18344998, 0.03347178, 0. , 0. , 0. ,
0.12222602, 0.04042501, 0. , 0. , 0.09945989,
0. , 0. , 0. , 0. , 0.34331564,
0. , 0.03776299, 0. , 0.0097911 , 0.13404105,
0. , 0.1440933 , 0. , 0. , 0. ,
0. , 0.00099355, 0. , 0. , 0. ,
0.12140984, 0. , 0.02176973, 0.02654141, 0. ,
0.11329586, 0. , 0. , 0. , 0.0614169 ,
0.05563679, 0. , 0.14482908, 0. , 0.04520304,
0.09473406, 0.04391808, 0. , 0.13507644, 0.08675567,
0. , 0. , 0. , 0. , 0.18475311,
0.01971689, 0. , 0. , 0.11864164, 0.01194291,
0.01938878, 0.02241326, 0. , 0.10206998, 0. ,
0. , 0.10903405, 0. , 0.08198068, 0. ,
0. , 0.00771368, 0.01515531, 0.09689011, 0. ,
0. , 0. , 0. , 0. , 0.04403437,
0.07158526, 0. , 0.12215691, 0.10023346, 0.04270752,
0. , 0.06533 , 0. , 0. , 0.13948429,
0.11520354, 0. , 0. , 0.25602588, 0. ,
0.13336065, 0. , 0.09488695, 0. , 0.23201059,
0. , 0. , 0. , 0. , 0.02449425,
0. , 0. ])
那么如何在朴素贝叶斯分类中使用互信息特征选择呢?
到目前为止我所做的是使用 count_vector 来获取特征计数
count_vector = CountVectorizer(analyzer=lambda x: x)
count_vector.fit(Token)
doc_array = count_vector.transform(Token).toarray()
frequency_matrix = pd.DataFrame(doc_array,columns=count_vector.get_feature_names())
df_gr = frequency_matrix #mengubah df menjadi bag of words
df_gr.insert(len(df_gr.columns), 'Sentimen', df_raw['Sentimen'])
拆分它
X_train,X_test,y_train,y_test=train_test_split(df_gr.drop(labels=['Sentimen'], axis=1),
df_gr['Sentimen'],
test_size=0.3,
random_state=0)
并找到mutual_info_classif
from sklearn.feature_selection import mutual_info_classif
# Mencari nilai Information gain menggunakan mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info
我不知道下一步将其应用于朴素贝叶斯分类器。