python - 使用 TfidfVectorizer 进行自然语言处理

Question

from sklearn.feature_extraction.text import TfidfVectorizer
filename='train1.txt'
dataset=[]
with open(filename) as f:
    for line in f:
        dataset.append([str(n) for n in line.strip().split(',')])
print (dataset)
tfidf=TfidfVectorizer()
tfidf.fit(dataset)
dict1=tfidf.vocabulary_
print 'Using tfidfVectorizer'
for key in dict1.keys():
    print key+" "+ str(dict1[key])

我正在读取文件 train1.txt 中的字符串。但是当尝试执行语句 tfidf.fit(dataset) 时，它会导致错误。我无法完全修复错误。寻求帮助。

错误日志：

Traceback (most recent call last):
  File "Q1.py", line 52, in <module>
    tfidf.fit(dataset)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1361, in fit
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
    self.fixed_vocabulary_)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
    for feature in analyze(doc):
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/opt/anaconda2/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

score 1 · Accepted Answer

根据TfidfVectorizer 的文档，该函数fit期望“产生 str、unicode 或文件对象的迭代”作为其第一个参数。您正在向它提供一个不满足此要求的列表列表。

您已经使用该方法将每一行变成了一个字符串列表split，因此您要么需要重新加入字符串，要么完全避免拆分它。当然，这取决于您的输入格式。

如果您修改该行，它应该可以工作

dataset.append([str(n) for n in line.strip().split(',')])

根据您的输入格式，您可能需要将其替换为类似

dataset.append(" ".join([str(n) for n in line.strip().split(',')]))

或者干脆

dataset.append(line.strip().replace(",", " "))

（我只能猜测输入文本中“，”的用法）。

python - 使用 TfidfVectorizer 进行自然语言处理

1 回答 1

Related

Reference