scikit-learn - 使用已标记化输入的 sklearn TfidfVectorizer？

Question

我有一个标记化句子的列表，并且想安装一个 tfidf Vectorizer。我尝试了以下方法：

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

其中错误为

AttributeError: 'list' object has no attribute 'lower'

有没有办法做到这一点？我有十亿个句子，不想再次标记它们。在此之前的另一个阶段之前，它们被标记化。

score 22 · Accepted Answer

尝试TfidfVectorizer使用参数初始化对象lowercase=False（假设这实际上是需要的，因为您在前面的阶段已经小写了标记）。

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)

请注意，我更改了句子，因为它们显然只包含停用词，由于词汇表空而导致另一个错误。

score 2 · Accepted Answer

尝试preprocessor代替tokenizer.

    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

如果x在上面的错误信息中是一个列表，那么x.lower()对列表执行操作会抛出错误。

您的两个示例都是停用词，因此要使此示例返回某些内容，请输入一些随机词。这是一个例子：

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

回报：

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

特征：

>>> tfidf.get_feature_names()
['cat', 'dog']

更新：也许lambda在标记器和预处理器上使用 s？

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                        preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> tfidf.get_feature_names()
['cat', 'dog']

score 0 · Accepted Answer

就像@Jarad 所说，只需为您的分析器使用“直通”功能，但它需要忽略停用词。您可以从以下位置获取停用词sklearn：

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

或来自nltk：

>>> import nltk
>>> nltk.download('stopwords')
>>> from nltk.corpus import stopwords
>>> stop_words = set(stopwords.words('english'))

或结合两组：

stop_words = stop_words.union(ENGLISH_STOP_WORDS)

但是你的例子只包含停用词（因为你所有的词都在sklearn.ENGLISH_STOP_WORDS集合中）。

Noetheless @Jarad 的示例有效：

>>> tokenized_list_of_sentences =  [
...     ['this', 'is', 'one', 'cat', 'or', 'dog'],
...     ['this', 'is', 'another', 'dog']]
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
>>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)

我喜欢pd.DataFrames 来浏览 TF-IDF 向量：

>>> import pandas as pd
>>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
        cat       dog 
0  0.814802  0.579739
1  0.000000  1.000000

scikit-learn - 使用已标记化输入的 sklearn TfidfVectorizer？

3 回答 3

Related

Reference