2

I am working on a clustering problem of social network profiles and each profile document is represented by number of times the 'term of interest occurs' in the profile description. To do clustering effectively, I am trying to find the correct similarity measure (or distance function) between two of the profiles.

So lets say I have following table of profiles

            basketball  cricket python
profile1        4           2     1
profile2        2           1     3
profile3        2           1     0

Now, going by calculating euclidean distance, I get

distance (profile1,profile2) = 3
distance (profile2,profile3) = 3
distance (profile3,profile1) = 2.45

Now, this is fine, but there are two questions coming to my mind

Here we are disregarding number of features that are common, for example, even though profile 1 and profile 3 are nearest, going by human intuition, profile 1 and profile 2 at least have some value in all three interests -basketball, cricket and python and hence these two profiles likely be more similar rather than profile 1 and profile 3 where one of them(profile 3) does not mention python in profile. I also don't want just count of similar features for distance which will yield surely wrong results.

My first question - Is there any way I can accommodate this intuition by any of the established ways?

My second question - there can be some profile authors more verbose than others, how to adjust this? because verbose author of profile having 4 occurrences of python may be same as less verbose author 2 occurrences of python.

I was not able to come up with good title for the question. So sorry if its confusing.

4

1 回答 1

1

首先,像你已经做的那样计算你的个人资料。那么关键的一步将是某种标准化。您可以将数字除以它们的总数,使数字总和为 1,或者您可以将它们除以它们的欧几里德范数,以便它们具有欧几里得范数 1。

例如,使用总和归一化,第一个配置文件将变为(四舍五入)

0.57, 0.29, 0.14

并使用欧几里得归一化,它会变成

0.87, 0.44, 0.22

这将确保所有配置文件都在相同的数字范围内表示,并会照顾“过于冗长的配置文件作者”。


下面是一个示例 IPython 会话,它显示了如何按行总和对行进行归一化,以及如何计算归一化行之间的欧几里德距离。您会看到,在标准化之后,配置文件 1 和 3 更加接近,正如您所期望的那样。

In [22]: p = array([[4,2,1],[2,1,3],[2,1,0]])

In [23]: p
Out[23]: 
array([[4, 2, 1],
       [2, 1, 3],
       [2, 1, 0]])

In [24]: p = p / p.sum(axis=1)[:,newaxis]

In [25]: p
Out[25]: 
array([[ 0.57142857,  0.28571429,  0.14285714],
       [ 0.33333333,  0.16666667,  0.5       ],
       [ 0.66666667,  0.33333333,  0.        ]])

In [26]: p.sum(axis=1)
Out[26]: array([ 1.,  1.,  1.])

In [27]: norm(p[0] - p[1])   # distance 1-2
Out[27]: 0.44543540318737401

In [28]: norm(p[0] - p[2])   # distance 1-3
Out[28]: 0.17817416127494959

In [29]: norm(p[1] - p[2])   # distance 2-3
Out[29]: 0.62360956446232352

最后,如果您想更加重视个人资料是否完全提及兴趣,而不是提及它的频率,您可以在标准化之前做一个额外的步骤:简单地计算您的个人资料向量的pow(x, alpha)每个元素x,其中alpha一个介于 0 和 1 之间的参数。这里,1 表示和以前一样的标准线性加权,当你使 alpha 接近 0 时,它意味着只提到兴趣计数,而不是多久。例如,使用alpha = 0.5(取配置文件的平方根),我们得到:

In [32]: p = array([[4,2,1],[2,1,3],[2,1,0]])

In [33]: p = sqrt(p)

In [34]: p
Out[34]: 
array([[ 2.        ,  1.41421356,  1.        ],
       [ 1.41421356,  1.        ,  1.73205081],
       [ 1.41421356,  1.        ,  0.        ]])

In [35]: p = p / p.sum(axis=1)[:,newaxis]

In [37]: norm(p[0] - p[1])   # distance 1-2
Out[37]: 0.2353133053319465

In [38]: norm(p[0] - p[2])   # distance 1-3
Out[38]: 0.27881275777438091

In [39]: norm(p[1] - p[2])   # distance 2-3
Out[39]: 0.51412606310632747

现在配置文件 1 和 2 是最接近的匹配项,因为我们更加强调他们都提到 Python 的事实,而不是他们提到它的频率。

于 2015-05-04T07:42:11.657 回答