python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档

Question

我可以使用文档中示例中的代码，其中 fit_transform() 函数的输入是句子列表，即：

corpus = [
   'this is the first document',
   'this is the second second document',
   'and the third one',
   'is this the first document?'
]

X = vectorizer.fit_transform(语料库)

并得到预期的数据。但是，当我尝试用文件列表或文件对象替换语料库时，如文档所示，它可以是：

" 适合(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents.
Parameters :    
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.
Returns :   
self :

"

..所以我认为我对管道的理解中缺少一些东西。给定一个我想要 CountVectorize 的文件目录，我该怎么做？如果我尝试提供文件对象列表，如 [open(file,'r')]，我得到的错误消息是文件对象没有较低的功能。

score 5 · Accepted Answer

将矢量化器的input 构造函数参数设置为filename或file。它的默认值为content，假设您已经将文件读入内存。

python-2.7 - 使用 sklearn.feature_extraction.text CountVectorizer 时从文件中读取文档

1 回答 1

Related

Reference