0

I have a dataset which has 14 different features/columns and 4328 rows, values of which I have then processed and converted into a NumPy array of shape (4328, 14). I have then applied Mean Shift on this NumPy array to train my model, which segregates the data points into 29 different clusters.

Cluster centres:

array([[ 0.00000000e+00,  2.88896062e+02,  2.78953471e+02,
         2.08648004e+02,  2.12223611e+02,  5.38985939e+01,
         3.71283150e-01,  5.70311771e+03,  4.54253094e-01,
         1.30592925e+00,  6.64259488e+00,  3.82481843e+00,
         6.43865296e+00,  6.43865296e+00],
       [ 0.00000000e+00,  2.83183908e+02,  9.48864664e+01,
         3.59258621e+03,  9.05744253e+01,  8.35206117e+00,
         4.13793103e-01,  5.70172414e+03,  2.78249425e-01,
         8.88868966e-01,  6.63727816e+00,  4.84751149e+00,
         6.61705172e+00,  6.61705172e+00],
       [ 0.00000000e+00,  3.15511628e+02,  7.55761355e+01,
         6.52134884e+03,  7.04900000e+01,  6.69296631e+00,
         3.72093023e-01,  5.69984767e+03,  3.52367442e-01,
         9.50423256e-01,  6.81103721e+00,  2.70016977e+00,
         3.48411628e+00,  3.48411628e+00],
       [ 0.00000000e+00,  2.98297297e+02,  4.95190674e+01,
         9.43194595e+03,  4.64532432e+01,  4.89748830e+00,
         3.24324324e-01,  5.69470405e+03,  1.71972973e-01,
         1.21458649e+00,  6.85496486e+00,  3.54600000e+00,
         5.62750811e+00,  5.62750811e+00],
       [ 0.00000000e+00,  3.60428571e+02,  3.22145995e+03,
         9.85714286e+00,  3.24273036e+03, -6.35189676e-01,
         4.64285714e-01,  5.65968214e+03, -2.39050000e-01,
         7.49132143e-01,  6.57582857e+00, -2.07893214e+00,
        -6.82446429e-01, -6.82446429e-01],
       [ 0.00000000e+00,  2.48600000e+02,  4.35963021e+01,
         1.18772000e+04,  4.21820000e+01,  3.25541197e+00,
         4.00000000e-01,  5.69281500e+03, -4.94350000e-01,
        -1.41250000e-01,  7.01363000e+00, -7.76800000e-02,
         2.37982000e+00,  2.37982000e+00],
       [ 0.00000000e+00,  2.56777778e+02,  3.86608797e+01,
         1.48944444e+04,  3.43100000e+01,  1.36524043e+01,
         2.22222222e-01,  5.70588333e+03, -4.92000000e-02,
         8.88366667e-01,  6.78814444e+00,  5.58971111e+00,
         6.56455556e+00,  6.56455556e+00],
       [ 0.00000000e+00,  3.14111111e+02,  4.78123643e+01,
         2.02325556e+04,  4.67500000e+01,  4.74006148e+00,
         5.55555556e-01,  5.70420556e+03, -2.40100000e-01,
         8.96300000e-01,  7.09418889e+00,  6.68292222e+00,
         1.12132667e+01,  1.12132667e+01],
       [ 0.00000000e+00,  3.47200000e+02,  3.63744453e+01,
         5.02000000e+04,  3.45700000e+01,  4.97221480e+00,
         8.00000000e-01,  5.67206000e+03, -9.79280000e-01,
        -1.08820000e-01,  7.67404000e+00,  1.17406000e+00,
         1.44780600e+01,  1.44780600e+01],
       [ 0.00000000e+00,  5.46000000e+02,  1.04748000e+04,
         5.66666667e+00,  1.02684667e+04,  2.01687216e+00,
         3.33333333e-01,  5.72818333e+03,  5.43600000e-01,
         1.35213333e+00,  5.60560000e+00,  3.07716667e+00,
         2.22003333e+00,  2.22003333e+00],
       [ 0.00000000e+00,  2.09000000e+02,  2.39866667e+02,
         1.17000000e+02,  2.33150000e+02,  1.67530023e+00,
         1.00000000e+00,  9.13930000e+03, -1.69290000e+00,
        -7.47800000e-01,  2.30790000e+00,  7.06666667e-01,
         1.86860000e+00,  1.86860000e+00],
       [ 0.00000000e+00,  2.01666667e+02,  6.86686111e+01,
         2.57380000e+04,  6.56333333e+01,  5.85024181e+00,
         3.33333333e-01,  5.75526667e+03,  1.19680000e+00,
         2.18410000e+00,  6.13906667e+00,  1.75683667e+01,
         1.90339000e+01,  1.90339000e+01],
       [ 0.00000000e+00,  5.08000000e+02,  4.60818500e+04,
         4.00000000e+00,  4.42663500e+03,  9.41967667e+02,
         5.00000000e-01,  5.73742500e+03, -2.17150000e-01,
         1.11570000e+00,  6.81375000e+00,  2.84170000e+00,
         1.07105000e+00,  1.07105000e+00],
       [ 0.00000000e+00,  5.15000000e+02,  1.23800000e+03,
         2.00000000e+00,  3.66200000e+01,  3.28066630e+03,
         0.00000000e+00,  5.70330000e+03,  2.96260000e+00,
         2.53060000e+00,  6.56880000e+00,  2.56620000e+00,
         5.00280000e+00,  5.00280000e+00],
       [ 0.00000000e+00,  1.53000000e+02,  2.67980246e+01,
         2.50000000e+05,  2.46500000e+01,  8.71409574e+00,
         1.00000000e+00,  5.70805000e+03, -9.63100000e-01,
         4.70000000e-01,  6.79200000e+00, -5.11360000e+00,
         8.20730000e+00,  8.20730000e+00],
       [ 0.00000000e+00,  5.74000000e+02,  2.67405322e+01,
         4.10020000e+04,  2.49200000e+01,  7.30550630e+00,
         1.00000000e+00,  5.73125000e+03,  2.08130000e+00,
         3.34910000e+00,  6.92330000e+00,  5.08680000e+00,
         8.58970000e+00,  8.58970000e+00],
       [ 0.00000000e+00,  5.22000000e+02,  1.00364364e+02,
         3.75630000e+04,  4.90300000e+01,  1.04699906e+02,
         1.00000000e+00,  5.71880000e+03,  7.04600000e-01,
         2.16130000e+00,  5.72310000e+00, -3.00900000e-01,
         1.32520000e+00,  1.32520000e+00],
       [ 0.00000000e+00,  3.46000000e+02,  2.24756530e+02,
         1.27403000e+05,  2.22800000e+02,  8.78155326e-01,
         1.00000000e+00,  5.70805000e+03, -9.63100000e-01,
         4.70000000e-01,  6.79200000e+00,  2.50200000e-01,
         5.96300000e+00,  5.96300000e+00],
       [ 0.00000000e+00,  3.09000000e+02,  4.50972829e+01,
         3.50000000e+04,  4.33000000e+01,  4.15076872e+00,
         0.00000000e+00,  5.67600000e+03,  9.75300000e-01,
         6.17300000e-01,  6.62310000e+00,  4.01550000e+01,
         4.19152000e+01,  4.19152000e+01],
       [ 0.00000000e+00,  3.46000000e+02,  2.26916384e+02,
         1.00000000e+05,  2.24950000e+02,  8.74142476e-01,
         1.00000000e+00,  5.65215000e+03, -1.88000000e-01,
         7.87500000e-01,  7.94750000e+00, -3.13200000e-01,
         6.47550000e+00,  6.47550000e+00],
       [ 0.00000000e+00,  3.46000000e+02,  2.20191000e+02,
         2.75000000e+05,  2.31950000e+02, -5.06962715e+00,
         1.00000000e+00,  5.70460000e+03, -8.96800000e-01,
        -3.83300000e-01,  5.95260000e+00,  5.14140000e+00,
         7.58010000e+00,  7.58010000e+00],
       [ 0.00000000e+00,  2.18000000e+02,  1.69836215e+02,
         6.00000000e+04,  1.73550000e+02, -2.13989340e+00,
         1.00000000e+00,  5.74695000e+03,  2.21600000e-01,
        -2.66200000e-01,  5.37060000e+00,  4.42260000e+00,
         1.03538000e+01,  1.03538000e+01],
       [ 0.00000000e+00,  9.10000000e+01,  5.03828125e+01,
         3.20000000e+04,  4.85000000e+01,  3.88208763e+00,
         0.00000000e+00,  5.71880000e+03,  7.04600000e-01,
         2.16130000e+00,  5.72310000e+00,  7.97870000e+00,
         1.43018000e+01,  1.43018000e+01],
       [ 0.00000000e+00,  1.82000000e+02,  3.66395435e+01,
         5.40000000e+04,  3.63500000e+01,  7.96543380e-01,
         1.00000000e+00,  5.67605000e+03, -1.73390000e+00,
        -2.81400000e-01,  8.15350000e+00, -2.00800000e+00,
         1.52570000e+00,  1.52570000e+00],
       [ 0.00000000e+00,  3.43000000e+02,  2.31617647e+01,
         1.70000000e+04,  2.16500000e+01,  6.98274691e+00,
         0.00000000e+00,  5.67600000e+03,  9.75300000e-01,
         6.17300000e-01,  6.62310000e+00,  2.45333000e+01,
         2.12987000e+01,  2.12987000e+01],
       [ 0.00000000e+00,  2.18000000e+02,  1.63871636e+02,
         1.19500000e+05,  1.61950000e+02,  1.18656127e+00,
         1.00000000e+00,  5.64800000e+03, -2.77500000e-01,
        -1.23880000e+00,  7.32370000e+00, -6.76500000e-01,
        -7.47950000e+00, -7.47950000e+00],
       [ 0.00000000e+00,  3.46000000e+02,  2.24871313e+02,
         7.25970000e+04,  2.22800000e+02,  9.29673637e-01,
         1.00000000e+00,  5.70805000e+03, -9.63100000e-01,
         4.70000000e-01,  6.79200000e+00,  2.50200000e-01,
         5.96300000e+00,  5.96300000e+00],
       [ 0.00000000e+00,  5.70000000e+01,  1.02000000e+01,
         2.35008000e+05,  1.05000000e+01, -2.85714286e+00,
         1.00000000e+00,  5.70460000e+03, -8.96800000e-01,
        -3.83300000e-01,  5.95260000e+00, -3.77360000e+00,
         2.51260000e+00,  2.51260000e+00],
       [ 0.00000000e+00,  2.10000000e+01,  1.19055525e+01,
         4.15000000e+05,  1.14000000e+01,  4.43467132e+00,
         1.00000000e+00,  5.67605000e+03, -1.73390000e+00,
        -2.81400000e-01,  8.15350000e+00, -1.69065000e+01,
        -2.84830000e+01, -2.84830000e+01]]))

Now, I tried plotting these clusters in a 2D plane which then produced this plot: enter image description here

Now, I'm not really sure why my clusters along with the various data points were plotted in a single line, with the value of X-axis for each coordinate being 0. Am I missing something here? Is there a different way I should be preprocessing my dataset if I want to cluster them into different clusters?

Edit 1: Code used to plot the above graph (clf is the name of my model object):

labels = clf.labels_
cluster_centers = clf.cluster_centers_
n_clusters_ = len(np.unique(labels))
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
4

1 回答 1

2

Since your data has 14 features the MeanShift will try to identify "blobs"/clusters in the 14-dimensional space and found 29 centers among your 4328 data points. So your output cluster describes 29 points in a 14-dimensional space - thereby the 29x14 shape - this is hard to visualize in a 2D-plot.

When your plotting, you're currently only using the first 2 dimensions of your cluster output (plot(X[my_members, 0], X[my_members, 1], ...) and as the first dimension seems to be all zeros the plotted points ends up as a line.

If you're only interested in clustering result you already has the result in the clf.labels_ output which should be a 4328x1 vector.

For visualization of higher dimensional points you could try splitting the cluster data into several subplots (perhaps 7 2D-plots) or try to reduce the dimensions in some way (you could start by removing the first column since all the values is the same - zero)

Another way to visualize higher dimensions in a 2D (or 3D plot) is t-SNE, perhaps you should check that out. It is also available as in scikit-learn and a quick intro in this Google Talk

于 2018-03-27T07:35:06.723 回答