I have a pandas dataframe like this, where each ID is an observation with variables attr1, attr2 and attr3:
ID attr1 attr2 attr3
20 2 1 2
10 1 3 1
5 2 2 4
7 1 2 1
16 1 2 3
28 1 1 3
35 1 1 1
40 1 2 3
46 1 2 3
21 3 1 3
and made a similarity matrix I want to use where the IDs are compared based on the sum of the pairwise attribute differences.
[[ 0. 4. 3. 3. 3. 2. 2. 3. 3. 2.]
[ 4. 0. 5. 1. 3. 4. 2. 3. 3. 6.]
[ 3. 5. 0. 4. 2. 3. 5. 2. 2. 3.]
[ 3. 1. 4. 0. 2. 3. 1. 2. 2. 5.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 4. 3. 3. 1. 0. 2. 1. 1. 2.]
[ 2. 2. 5. 1. 3. 2. 0. 3. 3. 4.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 6. 3. 5. 3. 2. 4. 3. 3. 0.]]
I tried DBSCAN from sklearn for clustering the data, but it seems only the clusters themselves are labeled? I want to find the ID for the data points in the visualization later. So I only want to cluster the difference between the IDs, but not the IDs themselves. Is there another algorithm better for this kind of data, or a way I can label the distance matrix values so it can be used with the DBSCAN or another method? ps.the dataset has over 50 attributes and 10000 observations