python - 在作为列表列表的数据帧的每一行中应用 TfidfVectorizer

Question

我有一个包含 2 列的 pandas 数据框，我想在其中之一中sklearn TfidfVectorizer用于文本分类。但是，此列是列表列表，并且 TFIDF 想要将原始输入作为文本。在这个问题中，如果我们只有一个列表列表，他们提供了一个解决方案，但我想问一下如何在我的数据框的每一行中应用这个函数，哪一行包含一个列表列表。先感谢您。

Input:

0    [[this, is, the], [first, row], [of, dataframe]]
1    [[that, is, the], [second], [row, of, dataframe]]
2    [[etc], [etc, etc]]

想要的输出：

0    ['this is the', 'first row', 'of dataframe']
1    ['that is the', 'second', 'row of dataframe']
2    ['etc', 'etc etc']

score 0 · Accepted Answer

您可以使用apply：

import pandas as pd

df = pd.DataFrame(data=[[[['this', 'is', 'the'], ['first', 'row'], ['of', 'dataframe']]],
                        [[['that', 'is', 'the'], ['second'], ['row', 'of', 'dataframe']]]],
                  columns=['paragraphs'])


df['result'] = df['paragraphs'].apply(lambda xs: [' '.join(x) for x in xs])
print(df['result'])

输出

0     [this is the, first row, of dataframe]
1    [that is the, second, row of dataframe]
Name: result, dtype: object

此外，如果您想将矢量化器与上述函数结合使用，您可以执行以下操作：

def vectorize(xs, vectorizer=TfidfVectorizer(min_df=1, stop_words="english")):
    text = [' '.join(x) for x in xs]
    return vectorizer.fit_transform(text)


df['vectors'] = df['paragraphs'].apply(vectorize)
print(df['vectors'].values)

python - 在作为列表列表的数据帧的每一行中应用 TfidfVectorizer

1 回答 1

Related

Reference