我想在 SciKitLearn 中使用决策树和 kNN 绘制“使用交叉验证的递归特征消除”,如此处所述
我想在我已经使用的分类器中实现这一点,以同时输出两个结果。但是,它一直给我一个错误。
这是我为 DT 修改的代码:
from collections import defaultdict
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sk.learn.feature_selection import RFECV
from sklearn.metrics import zero_one_loss
from scipy.sparse import csr_matrix
lemma2feat = defaultdict(lambda: defaultdict(float)) # { lemma: {feat : weight}}
lemma2cat = dict()
features = set()
with open("input.csv","rb") as infile:
for line in infile:
lemma, feature, weight, tClass = line.split()
lemma2feat[lemma][feature] = float(weight)
lemma2cat[lemma] = int(tClass)
features.add(feature)
sorted_rows = sorted(lemma2feat.keys())
col2index = dict()
for colIdx, col in enumerate(sorted(list(features))):
col2index[col] = colIdx
dMat = np.zeros((len(sorted_rows), len(col2index.keys())), dtype = float)
# populate matrix
for vIdx, vector in enumerate(sorted_rows):
for feature in lemma2feat[vector].keys():
dMat[vIdx][col2index[feature]] = lemma2feat[vector][feature]
# sort targ. results.
res = []
for lem in sorted_rows:
res.append(lemma2cat[lem])
clf = DecisionTreeClassifier(random_state=0)
rfecv = RFECV(estimator=DecisionTreeClassifier, step1, cv=10,
scoring='accuracy')
rfecv.fit(dMat)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()
print "Acc:"
print cross_val_score(clf, dMat, np.asarray(res), cv=10, scoring = "accuracy")
错误从第 56 行开始,更具体地说: rfecv = RFECV(estimator=DecisionTreeClassifier, step1, cv=10, SyntaxError: non-keyword arg after keyword arg
谁能提供有关如何更正我的代码以至少使用 DT 实现此功能的见解?
以下来自ogrisel的响应似乎解决了该论点的问题,但引发了以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/python.app/Contents/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "input.py", line 58, in <module>
rfecv.fit(col_index, rows)
File "/anaconda/python.app/Contents/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 321, in fit
X, y = check_arrays(X, y, sparse_format="csr")
File "/anaconda/python.app/Contents/lib/python2.7/site-packages/sklearn/utils/validation.py", line 211, in check_arrays
% (size, n_samples))
ValueError: Found array with dim 267. Expected 16
似乎 RFE 正在读取相反的输入文件格式(因为我的输入包含 16 个特征和 267 个目标)。这样,如何才能正确地将暗淡提供到代码中?
谢谢你。