尝试preprocessor
代替tokenizer
.
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
如果x
在上面的错误信息中是一个列表,那么x.lower()
对列表执行操作会抛出错误。
您的两个示例都是停用词,因此要使此示例返回某些内容,请输入一些随机词。这是一个例子:
tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
['this', 'is', 'another', 'dog']]
tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
tfidf.fit_transform(tokenized_sentences)
回报:
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
特征:
>>> tfidf.get_feature_names()
['cat', 'dog']
更新:也许lambda
在标记器和预处理器上使用 s?
tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
['this', 'is', 'another', 'dog']]
tfidf = TfidfVectorizer(tokenizer=lambda x: x,
preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)
<2x2 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> tfidf.get_feature_names()
['cat', 'dog']