I am working on a clustering problem of social network profiles and each profile document is represented by number of times the 'term of interest occurs' in the profile description. To do clustering effectively, I am trying to find the correct similarity measure (or distance function) between two of the profiles.
So lets say I have following table of profiles
basketball cricket python
profile1 4 2 1
profile2 2 1 3
profile3 2 1 0
Now, going by calculating euclidean distance, I get
distance (profile1,profile2) = 3
distance (profile2,profile3) = 3
distance (profile3,profile1) = 2.45
Now, this is fine, but there are two questions coming to my mind
Here we are disregarding number of features that are common, for example, even though profile 1 and profile 3 are nearest, going by human intuition, profile 1 and profile 2 at least have some value in all three interests -basketball, cricket and python and hence these two profiles likely be more similar rather than profile 1 and profile 3 where one of them(profile 3) does not mention python in profile. I also don't want just count of similar features for distance which will yield surely wrong results.
My first question - Is there any way I can accommodate this intuition by any of the established ways?
My second question - there can be some profile authors more verbose than others, how to adjust this? because verbose author of profile having 4 occurrences of python may be same as less verbose author 2 occurrences of python.
I was not able to come up with good title for the question. So sorry if its confusing.