python - 计算熊猫的 Tf-Idf 分数？

Question

我想从下面的文档中分别计算 tf 和 idf 。我正在使用python和熊猫。

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

我想使用不使用 Sklearn 库的 Tf-Idf 公式计算。

标记化后，我将其用于 TF 计算：

tf = df.sent.apply(pd.value_counts).fillna(0)

但这给了我一个计数，但我想要(count/total number of words).

对于 IDF： df[df['sent'] > 0] / (1 + len(df['sent'])

但它似乎不起作用。我想要 Tf 和 Idf 作为熊猫系列格式。

编辑

对于我使用的标记化，df['sent'] = df['sent'].apply(word_tokenize) 我得到了 idf 分数：

tfidf = TfidfVectorizer()
feature_array = tfidf.fit_transform(df['sent'])
d=(dict(zip(tfidf.get_feature_names(), tfidf.idf_)))

如何分别获得 tf 分数？

score 3 · Accepted Answer

你需要做更多的工作来计算这个。

import numpy as np

df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence', 
                        'This is the second sentence',
                        'This is the third sentence']})

# Tokenize and generate count vectors
word_vec = df.sent.apply(str.split).apply(pd.value_counts).fillna(0)

# Compute term frequencies
tf = word_vec.divide(np.sum(word_vec, axis=1), axis=0)

# Compute inverse document frequencies
idf = np.log10(len(tf) / word_vec[word_vec > 0].count()) 

# Compute TF-IDF vectors
tfidf = np.multiply(tf, idf.to_frame().T)

print(tfidf)

    is  the     first  This  sentence    second     third
0  0.0  0.0  0.095424   0.0       0.0  0.000000  0.000000
1  0.0  0.0  0.000000   0.0       0.0  0.095424  0.000000
2  0.0  0.0  0.000000   0.0       0.0  0.000000  0.095424

根据您的情况，您可能需要标准化：

# L2 (Euclidean) normalization
l2_norm = np.sum(np.sqrt(tfidf), axis=1)

# Normalized TF-IDF vectors
tfidf_norm = (tfidf.T / l2_norm).T

print(tfidf_norm)

    is  the     first  This  sentence    second     third
0  0.0  0.0  0.308908   0.0       0.0  0.000000  0.000000
1  0.0  0.0  0.000000   0.0       0.0  0.308908  0.000000
2  0.0  0.0  0.000000   0.0       0.0  0.000000  0.308908

score 1 · Accepted Answer

这是我的解决方案：

首先标记化，为方便起见作为单独的列：

df['tokens'] = [x.lower().split() for x in df.sent.values]

然后像你一样使用 TF，但使用 normalize 参数（出于技术原因，你需要一个 lambda 函数）：

tf = df.tokens.apply(lambda x: pd.Series(x).value_counts(normalize=True)).fillna(0)

然后是 IDF（词汇表中每个单词一个）：

idf = pd.Series([np.log10(float(df.shape[0])/len([x for x in df.tokens.values if token in x])) for token in tf.columns])
idf.index = tf.columns

那么如果你想要 TFIDF：

tfidf = tf.copy()
for col in tfidf.columns:
    tfidf[col] = tfidf[col]*idf[col]

score 0 · Accepted Answer

我想我和你有同样的问题。

我想使用 TfIdfVectorizer 但他们的默认 tf-idf 定义不是标准的（tf-idf = tf + tf*idf而不是正常的tf-idf = tf*idf）

TF = 术语“频率”通常用于表示计数。为此，您可以使用 sklearn 中的 CountVectorizer()。如果需要，需要记录转换和规范化。

使用 numpy 的选项在处理时间上要长得多（慢 50 倍以上）。

python - 计算熊猫的 Tf-Idf 分数？

编辑

3 回答 3

Related

Reference