1

Don't want to cause confusion here. The reason I want a value attached with each list is I want to use contents within each list as a feature value for clustering algorithms. The original idea is I have 1000 items each with a list of company names. I want to transform this list contents into a value. That's why I want each value attached to each list as one of the features for this item.... Thanks.... (Also that's why I use a base list..)

I'm trying to use python to analyze some texts and now I have 1000 lists, each contains list of company names. For example:

list1 = ['google', 'facebook' 'twitter', 'IBM']
list2 = ['microsoft', 'bloomberg', '1010Data']
list3 = ['google', 'microsoft', '1010Data']

I want to measure these lists similarities. list1 and list2 has 0 similarities, but list1 and list3, list2 and list3 have some similarities. But how to measure it?

Initially I thought about using one base vector which contains all the words from these lists. Here this base list could be:

base_list = ['google', 'facebook', 'twitter', 'IBM', 'microsoft', 'bloomberg','1010Data'] 

and its vector value is:

base_vector = [1, 1, 1, 1, 1, 1, 1]

Then each of these lists has vector values according to both the word appearance and its positions. (Here, base_list, list1, list2, list3 are all sorted)

list1 = [1, 1, 1, 1, 0, 0, 0]
list2 = [0, 0, 0, 0, 1, 1, 1]
list3 = [1, 0, 0, 0, 1, 0, 1]

I want to measure their differences (or similarities) by comparing each of them with the base_vector, to get the angle value.

But! A big issue could be

list1 = [1, 1, 1, 0, 0, 0]
list2 = [0, 0, 0, 1, 1, 1]

Then their angle value with the base vector are the same!

And suggestions? About how I can measure the similarities of the contents within lists? I mean I don't have to use this vector method, I just got blocked.

Thanks!

4

2 回答 2

0

您可以使用numpy计算列表之间的余弦相似度

>>> import numpy as np
>>> list2 = [0, 0, 0, 0, 1, 1, 1]
>>> list3 = [1, 0, 0, 0, 1, 0, 1]
>>> angle = np.dot(list2,list3)/(np.linalg.norm(list2)*np.linalg.norm(list3))
>>> angle
0.66666666666666674

或者,您可以使用scipy及其空间距离公式,如曼哈顿、欧几里得、杰卡德。它们也是相似度的度量。Scipy 还具有余弦相似性,这似乎更易于使用。

于 2014-11-04T14:54:05.473 回答
0

我想到了另一个使用Jaccard相似性的解决方案。您不必使用基本列表作为参考将列表转换为数字列表。只需应用公式

>>> list1 = ['google', 'facebook' 'twitter', 'IBM']
>>> list2 = ['microsoft', 'bloomberg', '1010Data']
>>> list3 = ['google', 'microsoft', '1010Data']

>>> float(len(set(list2).intersection(list3)))/len(set(list2).union(list3))
0.5
于 2014-11-04T15:20:00.520 回答