python - CountVectorizer：transform 方法在单个文本行上返回多维数组

Question

首先，我将它放在短信语料库中：

from sklearn.feature_extraction.text import CountVectorizer
clf = CountVectorizer()
X_desc = clf.fit_transform(X).toarray()

似乎工作正常：

X.shape = (5574,)
X_desc.shape = (5574, 8713)

但是后来我对文本行应用了变换方法，我们知道，结果应该是 (, 8713) 形状，但是我们看到的是：

str2 = 'Have you visited the last lecture on physics?'
print len(str2), clf.transform(str2).toarray().shape

52 (52, 8713)

这里发生了什么？还有一件事——所有的数字都是零

score 4 · Accepted Answer

您总是需要将数组或向量传递给transform; 如果您只想转换单个元素，则需要传递一个单例数组，然后提取其内容：

clf.transform([str1])[0]

顺便说一句，您获得二维数组作为输出的原因是 a 字符串实际上存储为字符列表，因此矢量化器将您的字符串视为数组，其中每个字符都被视为单个文档.

1 回答 1