Don't want to cause confusion here. The reason I want a value attached with each list is I want to use contents within each list as a feature value for clustering algorithms. The original idea is I have 1000 items each with a list of company names. I want to transform this list contents into a value. That's why I want each value attached to each list as one of the features for this item.... Thanks.... (Also that's why I use a base list..)
I'm trying to use python to analyze some texts and now I have 1000 lists, each contains list of company names. For example:
list1 = ['google', 'facebook' 'twitter', 'IBM']
list2 = ['microsoft', 'bloomberg', '1010Data']
list3 = ['google', 'microsoft', '1010Data']
I want to measure these lists similarities. list1
and list2
has 0 similarities, but list1
and list3
, list2
and list3
have some similarities. But how to measure it?
Initially I thought about using one base vector which contains all the words from these lists. Here this base list could be:
base_list = ['google', 'facebook', 'twitter', 'IBM', 'microsoft', 'bloomberg','1010Data']
and its vector value is:
base_vector = [1, 1, 1, 1, 1, 1, 1]
Then each of these lists has vector values according to both the word appearance and its positions.
(Here, base_list
, list1
, list2
, list3
are all sorted)
list1 = [1, 1, 1, 1, 0, 0, 0]
list2 = [0, 0, 0, 0, 1, 1, 1]
list3 = [1, 0, 0, 0, 1, 0, 1]
I want to measure their differences (or similarities) by comparing each of them with the base_vector, to get the angle value.
But! A big issue could be
list1 = [1, 1, 1, 0, 0, 0]
list2 = [0, 0, 0, 1, 1, 1]
Then their angle value with the base vector are the same!
And suggestions? About how I can measure the similarities of the contents within lists? I mean I don't have to use this vector method, I just got blocked.
Thanks!