所以显然..该means_
属性返回的结果与我为每个集群计算的平均值不同。(或者我对返回的内容有错误的理解!)
以下是我编写的代码,用于检查 GMM 如何适合我拥有的时间序列数据。
import numpy as np
import pandas as pd
import seaborn as sns
import time
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.mixture import BayesianGaussianMixture
from sklearn.mixture import GaussianMixture
toc = time.time()
input
包含(米数/样本数)x(特征数)
read = pd.read_csv('input', sep='\t', index_col= 0, header =0, \
names =['meter', '6:30', '9:00', '15:30', '22:30', 'std_year', 'week_score', 'season_score'], \
encoding= 'utf-8')
read.drop('meter', 1, inplace=True)
read['std_year'] = read['std_year'].divide(4).round(2)
input = read.as_matrix(columns=['6:30', '9:00', '15:30', '22:30',])
将其放入具有 10 个集群的 GMM 中。(使用 BIC 图,5 是得分最低的最佳数字..但在 -7,000。在与我的顾问讨论后,这并非不可能,但仍然很奇怪。)
gmm = GaussianMixture(n_components=10, covariance_type ='full', \
init_params = 'random', max_iter = 100, random_state=0)
gmm.fit(input)
print(gmm.means_.round(2))
cluster = gmm.predict(input)
我在下面做的是手动计算质心/中心 - 如果使用这些术语来表示平均向量是正确的 - 每个集群,使用从返回的标签.predict
。
具体来说,cluster 包含一个从 0 到 9 的值,每个值都表示集群。我将其转置并连接到(样本数)x(属性数)的输入矩阵作为数组。我想利用 pandas 库处理此类大数据的便捷性,因此将其转换为数据框。
cluster = np.array(cluster).reshape(-1,1) #(3488, 1)
ret = np.concatenate((cluster, input), axis=1) #(3488, 5)
ret_pd = pd.DataFrame(ret, columns=['label','6:30', '9:00', '15:30', '22:30'])
ret_pd['label'] = ret_pd['label'].astype(int)
对于每个仪表的功能,其群集被分类在“标签”列下。因此,每个标签的以下代码集群,然后我按列取平均值。
cluster_mean = []
for label in range(10):
#take mean by columns per each cluster
segment= ret_pd[ret_pd['label']== label]
print(segment)
turn = np.array(segment)[:, 1:]
print(turn.shape)
mean_ = np.mean(turn, axis =0).round(2) #series
print(mean_)
plt.plot(np.array(mean_), label='cluster %s' %label)
cluster_mean.append(list(mean_))
print(cluster_mean)
xvalue = ['6:30', '9:00', '15:30', '22:30']
plt.ylabel('Energy Use [kWh]')
plt.xlabel('time of day')
plt.xticks(range(4), xvalue)
plt.legend(loc = 'upper center', bbox_to_anchor = (0.5, 1.05),\
ncol =2, fancybox =True, shadow= True)
plt.savefig('cluster_gmm_100.png')
tic = time.time()
print('time ', tic-toc)
有趣的是,.means_
来自内部库的返回值与我在这段代码中计算的值不同。
Scikit-learn 的.means_
:
[[ 0.46 1.42 1.12 1.35]
[ 0.49 0.78 1.19 1.49]
[ 0.49 0.82 1.01 1.63]
[ 0.6 0.77 0.99 1.55]
[ 0.78 0.75 0.92 1.42]
[ 0.58 0.68 1.03 1.57]
[ 0.4 0.96 1.25 1.47]
[ 0.69 0.83 0.98 1.43]
[ 0.55 0.96 1.03 1.5 ]
[ 0.58 1.01 1.01 1.47]]
我的结果:
[[0.45000000000000001, 1.6599999999999999, 1.1100000000000001, 1.29],
[0.46000000000000002, 0.73999999999999999, 1.26, 1.48],
[0.45000000000000001, 0.80000000000000004, 0.92000000000000004, 1.78],
[0.68000000000000005, 0.72999999999999998, 0.85999999999999999, 1.5900000000000001],
[0.91000000000000003, 0.68000000000000005, 0.84999999999999998, 1.3600000000000001],
[0.58999999999999997, 0.65000000000000002, 1.02, 1.5900000000000001],
[0.35999999999999999, 1.03, 1.28, 1.46],
[0.77000000000000002, 0.88, 0.94999999999999996, 1.3500000000000001],
[0.53000000000000003, 1.0700000000000001, 0.97999999999999998, 1.53],
[0.66000000000000003, 1.21, 0.95999999999999996, 1.3600000000000001]]
作为一个方面,我不确定为什么我返回的结果没有正确四舍五入到 2 位十进制数字..