40

I have corpora of classified text. From these I create vectors. Each vector corresponds to one document. Vector components are word weights in this document computed as TFIDF values. Next I build a model in which every class is presented by a single vector. Model has as many vectors as there classes in the corpora. Component of a model vector is computed as mean of all component values taken from vectors in this class. For unclassified vectors I determine similarity with a model vector by computing cosine between these vectors.

Questions:

1) Can I use Euclidean Distance between unclassified and model vector to compute their similarity?

2) Why Euclidean distance can not be used as similarity measure instead of cosine of angle between two vectors and vice versa?

Thanks!

4

4 回答 4

40

考虑这一点的一种非正式但相当直观的方法是考虑向量的两个分量:方向幅度

方向是向量的“偏好”/“风格”/“情感”/“潜在变量”,而幅度是它朝向该方向的强度。

在对文档进行分类时,我们希望根据它们的整体情绪对其进行分类,因此我们使用角距离。

欧几里得距离容易受到文档被它们的 L2 范数(在二维情况下的大小)而不是方向聚集的影响。即具有完全不同方向的向量将被聚类,因为它们与原点的距离相似。

于 2014-01-27T12:57:53.427 回答
23

I'll answer the questions in reverse order. For your second question, Cosine Similarity and Euclidian Distance are two different ways to measure vector similarity. The former measures the similarity of vectors with respect to the origin, while the latter measures the distance between particular points of interest along the vector. You can use either in isolation, combine them and use both, or look at one of many other ways to determine similarity. See these slides from a Michael Collins lecture for more info.

Your first question isn't very clear, but you should be able to use either measure to find a distance between two vectors regardless of whether you're comparing documents or your "models" (which would more traditionally be described as clusters, where the model is the sum of all clusters).

于 2014-01-03T18:16:33.980 回答
5

计算时间明智(在python):

import time
import numpy as np

for i in range(10):
    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b))
    print 'Cosine similarity took', time.time() - start

    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        2 * (1 - np.dot(a, b) / ( np.linalg.norm(a) * np.linalg.norm(b)))
    print 'Euclidean from 2*(1 - cosine_similarity) took', time.time() - start


    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.linalg.norm(a-b)
    print 'Euclidean Distance using np.linalg.norm() took', time.time() - start


    start = time.time() 
    for i in range(10000):
        a, b = np.random.rand(100), np.random.rand(100) 
        np.sqrt(np.sum((a-b)**2))
    print 'Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took', time.time() - start
    print '--------------------------------------------------------'

[出去]:

Cosine similarity took 0.15826010704
Euclidean from 2*(1 - cosine_similarity) took 0.179041862488
Euclidean Distance using np.linalg.norm() took 0.10684299469
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.113723039627
--------------------------------------------------------
Cosine similarity took 0.161732912064
Euclidean from 2*(1 - cosine_similarity) took 0.178358793259
Euclidean Distance using np.linalg.norm() took 0.107393980026
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111194849014
--------------------------------------------------------
Cosine similarity took 0.16274189949
Euclidean from 2*(1 - cosine_similarity) took 0.178978919983
Euclidean Distance using np.linalg.norm() took 0.106336116791
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111373186111
--------------------------------------------------------
Cosine similarity took 0.161939144135
Euclidean from 2*(1 - cosine_similarity) took 0.177414178848
Euclidean Distance using np.linalg.norm() took 0.106301784515
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.11181807518
--------------------------------------------------------
Cosine similarity took 0.162333965302
Euclidean from 2*(1 - cosine_similarity) took 0.177582979202
Euclidean Distance using np.linalg.norm() took 0.105742931366
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111120939255
--------------------------------------------------------
Cosine similarity took 0.16153883934
Euclidean from 2*(1 - cosine_similarity) took 0.176836967468
Euclidean Distance using np.linalg.norm() took 0.106392860413
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110891103745
--------------------------------------------------------
Cosine similarity took 0.16018986702
Euclidean from 2*(1 - cosine_similarity) took 0.177738189697
Euclidean Distance using np.linalg.norm() took 0.105060100555
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110497951508
--------------------------------------------------------
Cosine similarity took 0.159607887268
Euclidean from 2*(1 - cosine_similarity) took 0.178565979004
Euclidean Distance using np.linalg.norm() took 0.106383085251
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.11084485054
--------------------------------------------------------
Cosine similarity took 0.161075115204
Euclidean from 2*(1 - cosine_similarity) took 0.177822828293
Euclidean Distance using np.linalg.norm() took 0.106630086899
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.110257148743
--------------------------------------------------------
Cosine similarity took 0.161051988602
Euclidean from 2*(1 - cosine_similarity) took 0.181928873062
Euclidean Distance using np.linalg.norm() took 0.106360197067
Euclidean Distance using np.sqrt(np.sum((a-b)**2)) took 0.111301898956
--------------------------------------------------------
于 2016-04-05T09:49:01.017 回答
0

我建议确定在给定应用程序中哪种距离测量更好的唯一可靠方法是尝试两者,看看哪一个能给你更满意的结果。我猜想在大多数情况下,有效性的差异不会很大,但在您的特定应用程序中可能并非如此。

于 2019-01-03T06:13:37.380 回答