我将数据分成两部分,并对数据集进行 K-means 聚类以及归一化和 PCA。现在我想将聚类图投影回箱线图中,以检查哪些实例(行)位于哪些聚类中并寻找异常值。
描述:使用 pandas 加载数据,使用 min_max_scaler 和预处理进行归一化,应用 PCA 并进行聚类。
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing, decomposition, cluster
# Load data from input file
X = pd.read_csv("/Users/blah blah.csv")
X.plot.scatter(x=6,y=7)
# Normalise the data
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(X)
X_norm=pd.DataFrame(np_scaled, columns=X.columns)
# PCA
pca = decomposition.PCA(n_components=5)
pca_model = pca.fit(X_norm)
print(pca.explained_variance_ratio_)
print(pca_model.components_)
pca_array = pca_model.transform(X_norm)
X_pca = pd.DataFrame(data=pca_array, columns=['PC1','PC2','PC3','PC4','PC5'])
X_pca.plot.scatter(x=1, y=3)
# K-Means Implementation
n_clusters=5
kmeans = cluster.KMeans(n_clusters=n_clusters, init='random', n_init=1,algorithm='full')
ac = kmeans.fit(X_pca)
print('\n..........Cluster centers............\n')
print(kmeans.cluster_centers_)
print('\n.........Cluster labels.........\n')
print(kmeans.labels_)
print('\n.............Scatter Plot K-Means......... \n')
X_pca.plot.scatter(x=1, y=3, c=kmeans.labels_, cmap='rainbow', title='K-Means Clustering')
plt.show()