我在一组短文档上训练了一个分类器,并在获得二元分类任务的合理 f1 和准确度分数后对其进行了腌制。
在训练时,我使用 sciki-learn countVectorizer
cv 减少了特征的数量:
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
然后使用fit_transform()
andtransform()
方法得到变换后的训练集和测试集:
transformed_feat_train = numpy.zeros((0,0,))
transformed_feat_test = numpy.zeros((0,0,))
transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
transformed_feat_test = cv.transform(testingTextFeat).toarray()
这一切都适用于训练和测试分类器。但是,我不确定如何使用经过训练的分类fit_transform()
器transform()
的腌制版本来预测看不见的、未标记数据的标签。
我正在以与训练/测试分类器时完全相同的方式提取未标记数据的特征:
## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)
## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)
transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()
## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)
错误信息:
Traceback (most recent call last):
File "../clf.py", line 615, in <module>
if __name__=="__main__": main()
File "../clf.py", line 579, in main
cv.fit_transform(pickledClassifierFile)
File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words