1

I have a pandas dataframe like this, where each ID is an observation with variables attr1, attr2 and attr3:

    ID      attr1     attr2     attr3  
  20         2         1         2  
  10         1         3         1  
   5         2         2         4  
   7         1         2         1  
  16         1         2         3  
  28         1         1         3  
  35         1         1         1  
  40         1         2         3  
  46         1         2         3
  21         3         1         3

and made a similarity matrix I want to use where the IDs are compared based on the sum of the pairwise attribute differences.

[[ 0.  4.  3.  3.  3.  2.  2.  3.  3.  2.]
 [ 4.  0.  5.  1.  3.  4.  2.  3.  3.  6.]
 [ 3.  5.  0.  4.  2.  3.  5.  2.  2.  3.]
 [ 3.  1.  4.  0.  2.  3.  1.  2.  2.  5.]
 [ 3.  3.  2.  2.  0.  1.  3.  0.  0.  3.]
 [ 2.  4.  3.  3.  1.  0.  2.  1.  1.  2.]
 [ 2.  2.  5.  1.  3.  2.  0.  3.  3.  4.]
 [ 3.  3.  2.  2.  0.  1.  3.  0.  0.  3.]
 [ 3.  3.  2.  2.  0.  1.  3.  0.  0.  3.]
 [ 2.  6.  3.  5.  3.  2.  4.  3.  3.  0.]]

I tried DBSCAN from sklearn for clustering the data, but it seems only the clusters themselves are labeled? I want to find the ID for the data points in the visualization later. So I only want to cluster the difference between the IDs, but not the IDs themselves. Is there another algorithm better for this kind of data, or a way I can label the distance matrix values so it can be used with the DBSCAN or another method? ps.the dataset has over 50 attributes and 10000 observations

4

1 回答 1

0

labels_ 属性将为您提供训练中每个数据点的标签数组。该数组的第一个索引是您的第一个训练数据点的标签,依此类推。

于 2017-06-13T15:22:42.080 回答