8

我正在尝试使用卡方(scikit-learn 0.10)选择最佳功能。从总共 80 个训练文档中,我首先提取 227 个特征,然后从这 227 个特征中选择前 10 个。

my_vectorizer = CountVectorizer(analyzer=MyAnalyzer())      
X_train = my_vectorizer.fit_transform(train_data)
X_test = my_vectorizer.transform(test_data)
Y_train = np.array(train_labels)
Y_test = np.array(test_labels)
X_train = np.clip(X_train.toarray(), 0, 1)
X_test = np.clip(X_test.toarray(), 0, 1)    
ch2 = SelectKBest(chi2, k=10)
print X_train.shape
X_train = ch2.fit_transform(X_train, Y_train)
print X_train.shape

结果如下。

(80, 227)
(80, 14)

k如果我设置为 ,它们是相似的100

(80, 227)
(80, 227)

为什么会这样?

*编辑:一个完整​​的输出示例,现在没有剪辑,我请求 30 并得到 32 代替:

Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
 [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 32)
Using 32(requested:30) best features from 9 training documents
get support:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True]
get support with vocabulary :
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
  scale_C)
Classifying...

另一个没有剪裁的例子,我请求 10 并得到 11:

Train instances: 9 Test instances: 1
Feature extraction...
X_train:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0]
 [0 0 2 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 1]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0]]
Y_train:
[0 0 0 0 0 0 0 0 1]
32 features extracted from 9 training documents.
Feature selection...
(9, 32)
(9, 11)
Using 11(requested:10) best features from 9 training documents
get support:
[ True  True  True False False  True False False False False  True False
 False False  True False False False  True False  True False  True  True
 False False False False  True False False False]
get support with vocabulary :
[ 0  1  2  5 10 14 18 20 22 23 28]
Training...
/usr/local/lib/python2.6/dist-packages/scikit_learn-0.10-py2.6-linux-x86_64.egg/sklearn/svm/sparse/base.py:23: FutureWarning: SVM: scale_C will be True by default in scikit-learn 0.11
  scale_C)
Classifying...
4

1 回答 1

5

您是否检查过get_support()函数返回的内容(ch2应该有这个成员函数)?这将返回在最佳 k 中选择的索引。

我的猜想是由于您正在执行的数据裁剪(或者由于重复的特征向量,如果您的特征向量是分类的并且可能有重复)而存在联系,并且 scikits 函数返回所有被绑定的条目为前 k 个点。您设置的额外示例k = 100对这个猜想产生了一些疑问,但值得一看。

查看get_support()返回的内容,并检查X_train这些索引的外观,查看裁剪是否会导致大量特征重叠,从而在使用的 chi^2 p 值等级中创建联系SelectKBest

如果事实证明是这种情况,您应该向 scikits.learn 提交错误/问题,因为目前他们的文档没有说明SelectKBest在出现平局时会做什么。显然,它不能只取一些绑定索引而不取其他索引,但至少应该警告用户,绑定可能导致意外的特征维度降低。

于 2012-04-30T04:57:07.437 回答