java - tf idf 相似度

Question

我正在使用 TF/IDF 来计算相似度。例如，如果我有以下两个文档。

Doc A => cat dog
Doc B => dog sparrow

这是正常的，它的相似性是 50%，但是当我计算它的 TF/IDF 时。如下

Doc A 的 Tf 值

dog tf = 0.5
cat tf = 0.5

Doc B 的 Tf 值

dog tf = 0.5
sparrow tf = 0.5

文档 A 的 IDF 值

dog idf = -0.4055
cat idf = 0

Doc B 的 IDF 值

dog idf = -0.4055 ( without +1 formula 0.6931)
sparrow idf = 0

文档 A 的 TF/IDF 值

0.5x-0.4055 + 0.5x0 = -0.20275

Doc B 的 TF/IDF 值

0.5x-0.4055 + 0.5x0 = -0.20275

现在看起来有 -0.20275 相似度。是吗？还是我错过了什么？或者还有什么下一步？请告诉我，这样我也可以计算出来。

我使用了维基百科提到的 tf/idf 公式

score 17 · Accepted Answer

让我们看看我是否得到您的问题：您想计算两个文档之间的 TF/IDF 相似性：

Doc A: cat dog

和

Doc B: dog sparrow

我认为这是你的整个语料库。因此|D| = 2 ，所有单词的 Tfs 确实为 0.5。要计算 'dog'log(|D|/|d:dog in d| = log(2/2) = 0 的 IDF，同样地，'cat' 和 'sparrow' 的 IDF 是log(2/1) = log(2) =1 （我使用 2 作为对数基数以使这更容易）。

因此，'dog' 的 TF/IDF 值为 0.5*0 = 0，'cat' 和 'sparrow' 的 TF/IDF 值为 0.5*1 = 0.5

要测量两个文档之间的相似性，您应该计算（猫，麻雀，狗）空间中的向量之间的余弦：（0.5, 0, 0）和（0, 0.5, 0）并得到结果 0。

把它们加起来：

您在 IDF 计算中有错误。
此错误会创建错误的 TF/IDF 值。
Wikipedia 文章没有很好地解释使用 TF/IDF 进行相似性。我更喜欢Manning、Raghavan 和 Schütze 的解释。

score 0 · Accepted Answer

0

我认为您必须使用 ln 而不是 log。

于 2010-01-03T16:10:39.600 回答

score 0 · Accepted Answer

def calctfidfvec(tfvec, withidf):
    tfidfvec = {}
    veclen = 0.0

    for token in tfvec:
        if withidf:
            tfidf = (1+log10(tfvec[token])) * getidf(token)
        else:
            tfidf = (1+log10(tfvec[token]))
        tfidfvec[token] = tfidf 
        veclen += pow(tfidf,2)

    if veclen > 0:
        for token in tfvec: 
            tfidfvec[token] /= sqrt(veclen)

    return tfidfvec

def cosinesim(vec1, vec2):
    commonterms = set(vec1).intersection(vec2)
    sim = 0.0
    for token in commonterms:
        sim += vec1[token]*vec2[token]

    return sim

java - tf idf 相似度

3 回答 3

Related

Reference