python - 如何为 TfidfVectorizer 使用列表列表或集合列表？

Question

我正在使用sklearn TfidfVectorizer进行文本分类。

我知道这个矢量化器需要原始文本作为输入，但使用列表是有效的（参见 input1）。

但是，如果我想使用多个列表（或集合），我会收到以下属性错误。

有谁知道如何解决这个问题？提前致谢！

    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(min_df=1, stop_words="english")
    input1 = ["This", "is", "a", "test"]
    input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]]

    print(vectorizer.fit_transform(input1)) #works
    print(vectorizer.fit_transform(input2)) #gives Attribute error

input 1:
  (3, 0)    1.0

input 2:

Traceback（最近一次调用最后一次）：文件“”，第 1 行，在文件“/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py”中，第 1381 行，在 fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) 文件“/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py "，第 869 行，在 fit_transform self.fixed_vocabulary_) 文件中 "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py"，第 792 行，在 _count_vocab对于分析（doc）中的功能：文件“/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py”，第266行，在标记化（预处理（self.decode（doc））），stop_words）文件“/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/feature_extraction/text.py”，第 232 行，作为回报 lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower'

score 6 · Accepted Answer

请注意 input1 有效，但它将列表（字符串）的每个元素视为要矢量化的不同文档。

在 input2 的情况下，我假设您想对每个“句子”（子列表）进行矢量化。一种解决方案是使用以下列表理解语法：

input2_corrected = [" ".join(x) for x in input2]

产生

['This is a test', 'It is raining today']

这不再产生 AttributeError 了。

python - 如何为 TfidfVectorizer 使用列表列表或集合列表？

1 回答 1

Related

Reference