python - 在 pandas 的新文章中存储 Tf-idf 矩阵并更新现有矩阵

Question

我有一个熊猫数据框，其列text由news articles. 给定为：-

text
article1
article2
article3
article4

我将文章的 Tf-IDF 值计算为：-

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
matrix_1 = tfidf.fit_transform(df['text'])

由于我的数据框会不时更新。因此，假设在将 of-if 计算为 matrix_1 之后，我的数据框得到了更多文章的更新。就像是：

text
article1
article2
article3
article4
article5
article6
article7

因为我有数百万篇文章，所以我想存储上一篇文章的 tf-IDF 矩阵，并用新文章的 tf-IDF 分数更新它。一次又一次地为所有文章运行 of-IDF 代码会消耗内存。有什么办法可以做到这一点？

score 0 · Accepted Answer

我没有测试过这段代码，但我觉得这应该可以。

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.DataFrame()
while True:
    if not len(df):
        # When you dataframe is populated for the very first time
        tfidf = TfidfVectorizer()
        matrix_1 = tfidf.fit_transform(df['text'].iloc[last_len:])
        last_len = len(df)
    else:
        # When you dataframe is populated again and again
        # If you have to use earlier fitted model
        matrix_1 = np.vstack(matrix_1, tfidf.transform(df['text'].iloc[last_len:]))
        # If you have to update tf-idf every time which is kinda doesn't make sense
        matrix_1 = np.vstack(matrix_1, tfidf.fit_transform(df['text'].iloc[last_len:]))
        last_len = len(df)

    # TO-DO Some break condition according to your case
    #####

如果数据帧更新之间的持续时间长于您可以在 matrix_1 上使用 pickle 来存储中间结果。

然而，我觉得tfidf.fit_transform(df['text'])在不同的输入上一次又一次地使用不会给你任何有意义的结果，或者我可能误解了。干杯！！

python - 在 pandas 的新文章中存储 Tf-idf 矩阵并更新现有矩阵

1 回答 1

Related

Reference