我已经在 sci-kit learn 中实现了一个带有 CV 的 DT 分类器。但是,我还想输出有助于分类的特征数量。这是我到目前为止的代码:
from collections import defaultdict
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from scipy.sparse import csr_matrix
lemma2feat = defaultdict(lambda: defaultdict(float)) # { lemma: {feat : weight}}
lemma2cat = dict()
features = set()
with open("input.csv","rb") as infile:
for line in infile:
lemma, feature, weight, tClass = line.split()
lemma2feat[lemma][feature] = float(weight)
lemma2cat[lemma] = int(tClass)
features.add(feature)
sorted_rows = sorted(lemma2feat.keys())
col2index = dict()
for colIdx, col in enumerate(sorted(list(features))):
col2index[col] = colIdx
dMat = np.zeros((len(sorted_rows), len(col2index.keys())), dtype = float)
# popola la matrice
for vIdx, vector in enumerate(sorted_rows):
for feature in lemma2feat[vector].keys():
dMat[vIdx][col2index[feature]] = lemma2feat[vector][feature]
res = []
for lem in sorted_rows:
res.append(lemma2cat[lem])
clf = DecisionTreeClassifier(random_state=0)
print "Acc:"
print cross_val_score(clf, dMat, np.asarray(res), cv=10, scoring = "accuracy")
我可以包含什么来输出特征数量,例如,我查看了 RFE,正如我在另一个问题中询问的那样,但它不能轻易地包含在 DT 中。因此,我想知道是否有办法修改我的上述代码以输出有助于最高精度的特征数量。此处的总体目标是然后将其与其他分类器的输出进行比较。谢谢你。