0

我正在尝试使用朴素贝叶斯分类器比较两个特征选择、TF-IDF 和信息增益(互信息)之间的准确性。

对于 Tf-idf 我这样做了:

tfidf_vector = TfidfVectorizer(analyzer=lambda x:x)
tfidf_vector.fit_transform(Token)
doc_array2 = tfidf_vector.transform(Token).toarray()
frequency_matrix_tfidf = pd.DataFrame(doc_array2,columns=tfidf_vector.get_feature_names())
df3 = frequency_matrix_tfidf
df3.insert(len(df3.columns), 'Sentimen', df1['Sentimen'])

这就是我使用train_test_split和使用数据框MultinomialNB对其进行分类之前的样子

ad  addict  ade     adik    ah  aja     ajar    ak  akses   aktif   ...     warga   wkwk    wkwkw   wkwkwk  x   ya  yaa     yg  yuk     Sentimen
0   0.0     0.0     0.000000    0.0     0.000000    0.000000    0.0     0.0     0.0     0.0     ...     0.0     0.0     0.0     0.0     0.0     0.000000    0.0     0.0     0.0     0
1   0.0     0.0     0.000000    0.0     0.000000    0.000000    0.0     0.0     0.0     0.0     ...     0.0     0.0     0.0     0.0     0.0     0.000000    0.0     0.0     0.0     0
2   0.0     0.0     0.000000    0.0     0.000000    0.000000    0.0     0.0     0.0     0.0     ...     0.0     0.0     0.0     0.0     0.0     0.251779    0.0     0.0     0.0     0
3   0.0     0.0     0.000000    0.0     0.194082    0.122158    0.0     0.0     0.0     0.0     ...     0.0     0.0     0.0     0.0     0.0     0.000000    0.0     0.0     0.0     0
4   0.0     0.0     0.239806    0.0     0.000000    0.000000    0.0     0.0     0.0     0.0     ...     0.0     0.0     0.0     0.0     0.0     0.000000    0.0     0.0     0.0     0

但我不知道如何为互信息做这件事,因为我从互信息中得到的输出与 tf-idf 不同(上面的例子)。它返回一维数组

array([0.        , 0.        , 0.        , 0.05187983, 0.        ,
       0.        , 0.        , 0.        , 0.08828866, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.11530285,
       0.        , 0.08021988, 0.05897961, 0.        , 0.06824362,
       0.        , 0.        , 0.02786951, 0.        , 0.05014545,
       0.28257764, 0.        , 0.00984759, 0.        , 0.04362618,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.15165016, 0.01021197, 0.06610714, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.01686709, 0.        , 0.        , 0.18940527, 0.115353  ,
       0.09879918, 0.19144364, 0.064697  , 0.06547344, 0.        ,
       0.        , 0.        , 0.        , 0.47194838, 0.        ,
       0.        , 0.        , 0.10342815, 0.03847181, 0.04500324,
       0.        , 0.25270658, 0.36717759, 0.        , 0.        ,
       0.        , 0.        , 0.04925613, 0.03009996, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.02816479, 0.        , 0.18201676, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.0052056 ,
       0.        , 0.04160016, 0.11510562, 0.        , 0.09763579,
       0.08849817, 0.        , 0.        , 0.        , 0.02353365,
       0.        , 0.        , 0.03959714, 0.        , 0.03214612,
       0.19341475, 0.11260033, 0.        , 0.        , 0.00128479,
       0.        , 0.07341715, 0.00729505, 0.        , 0.1281784 ,
       0.22364735, 0.        , 0.        , 0.3281854 , 0.        ,
       0.        , 0.        , 0.04169775, 0.02608552, 0.02171819,
       0.06591236, 0.        , 0.03454681, 0.        , 0.12895553,
       0.02310305, 0.09715215, 0.12950234, 0.08790128, 0.06153182,
       0.        , 0.        , 0.        , 0.0818714 , 0.05503847,
       0.        , 0.0026008 , 0.12831081, 0.0441718 , 0.2112707 ,
       0.        , 0.        , 0.08382308, 0.02858223, 0.        ,
       0.        , 0.25151498, 0.06671354, 0.        , 0.10150897,
       0.11968319, 0.11681159, 0.        , 0.06950559, 0.05414106,
       0.13507679, 0.02147254, 0.        , 0.09186146, 0.        ,
       0.04002647, 0.12623272, 0.        , 0.        , 0.        ,
       0.        , 0.03283483, 0.01362932, 0.05143286, 0.        ,
       0.12247352, 0.        , 0.        , 0.        , 0.05200576,
       0.        , 0.        , 0.15432282, 0.10984263, 0.        ,
       0.        , 0.1123998 , 0.        , 0.        , 0.15091267,
       0.        , 0.        , 0.07071549, 0.        , 0.08633096,
       0.        , 0.05164792, 0.        , 0.30434291, 0.        ,
       0.10498175, 0.        , 0.08335206, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.1173136 , 0.06708513, 0.        ,
       0.        , 0.        , 0.        , 0.05603758, 0.        ,
       0.07544079, 0.03250152, 0.        , 0.02241439, 0.        ,
       0.        , 0.40283826, 0.05634349, 0.        , 0.        ,
       0.18344998, 0.03347178, 0.        , 0.        , 0.        ,
       0.12222602, 0.04042501, 0.        , 0.        , 0.09945989,
       0.        , 0.        , 0.        , 0.        , 0.34331564,
       0.        , 0.03776299, 0.        , 0.0097911 , 0.13404105,
       0.        , 0.1440933 , 0.        , 0.        , 0.        ,
       0.        , 0.00099355, 0.        , 0.        , 0.        ,
       0.12140984, 0.        , 0.02176973, 0.02654141, 0.        ,
       0.11329586, 0.        , 0.        , 0.        , 0.0614169 ,
       0.05563679, 0.        , 0.14482908, 0.        , 0.04520304,
       0.09473406, 0.04391808, 0.        , 0.13507644, 0.08675567,
       0.        , 0.        , 0.        , 0.        , 0.18475311,
       0.01971689, 0.        , 0.        , 0.11864164, 0.01194291,
       0.01938878, 0.02241326, 0.        , 0.10206998, 0.        ,
       0.        , 0.10903405, 0.        , 0.08198068, 0.        ,
       0.        , 0.00771368, 0.01515531, 0.09689011, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.04403437,
       0.07158526, 0.        , 0.12215691, 0.10023346, 0.04270752,
       0.        , 0.06533   , 0.        , 0.        , 0.13948429,
       0.11520354, 0.        , 0.        , 0.25602588, 0.        ,
       0.13336065, 0.        , 0.09488695, 0.        , 0.23201059,
       0.        , 0.        , 0.        , 0.        , 0.02449425,
       0.        , 0.        ])

那么如何在朴素贝叶斯分类中使用互信息特征选择呢?

到目前为止我所做的是使用 count_vector 来获取特征计数

count_vector = CountVectorizer(analyzer=lambda x: x)
count_vector.fit(Token)
doc_array = count_vector.transform(Token).toarray()    
frequency_matrix = pd.DataFrame(doc_array,columns=count_vector.get_feature_names())
df_gr = frequency_matrix #mengubah df menjadi bag of words
df_gr.insert(len(df_gr.columns), 'Sentimen', df_raw['Sentimen'])

拆分它

X_train,X_test,y_train,y_test=train_test_split(df_gr.drop(labels=['Sentimen'], axis=1),
    df_gr['Sentimen'],
    test_size=0.3,
    random_state=0)

并找到mutual_info_classif

from sklearn.feature_selection import mutual_info_classif
# Mencari nilai Information gain menggunakan mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info

我不知道下一步将其应用于朴素贝叶斯分类器。

4

0 回答 0