r - R中的聚类时间序列-K均值准确吗？

Question

我的数据集由 105 个国家（行）的 14 年（列）相同指数的测量组成。我想根据一段时间内的指数趋势对国家/地区进行聚类。

我正在尝试利用 DTW 距离矩阵（包）的分层聚类（ hclust）和 K Medoids（）。pamdtw

我还尝试了 K 均值，使用 DTW 距离矩阵作为函数的第一个参数kmeans。该算法有效，但我不确定其准确性，因为 K 均值利用欧几里德距离并将质心计算为均值。

我也在考虑直接使用数据，但我无法理解结果如何准确，因为该算法会将同一变量随时间的不同测量视为不同变量，以便计算每次迭代的质心和欧几里德距离将观察结果分配给集群。在我看来，这个过程似乎不能对时间序列以及 Hierarchical 和 K Medoids 聚类进行聚类。

在对时间序列进行聚类时，K 均值算法是一个不错的选择，还是最好使用利用距离概念的算法作为 DTW（但速度较慢）？它是否存在允许使用具有距离矩阵的 K 均值算法或特定包对时间序列数据进行聚类的 R 函数？

score 0 · Accepted Answer

KMeans 将完全按照您的指示去做。不幸的是，尝试将时间序列数据集输入 KMeans 算法将导致毫无意义的结果。KMeans 算法和大多数通用聚类方法都是围绕欧几里德距离构建的，这似乎不是时间序列数据的良好度量。很简单，当集群不是圆形时，K-means 通常不起作用，因为它使用某种距离函数并且距离是从集群中心测量的。查看 GMM 算法作为替代方案。听起来您将使用 R 进行此实验。如果是这样，请查看下面的示例代码。

这是一个 KMeans 集群。

这是一个 GMM 集群。

哪一个看起来更像你的时间序列图？？！

我在 Google 上搜索了一个很好的 R 代码示例来演示 GMM 聚类是如何工作的。不幸的是，我找不到任何像样的东西。就个人而言，我使用 Python 比使用 R 多得多。如果您对 Python 解决方案持开放态度，请查看下面的示例代码。

import numpy as np
import itertools

from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn import mixture

print(__doc__)

# Number of samples per component
n_samples = 500

# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
          .7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]

lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = mixture.GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(X)
        bic.append(gmm.bic(X))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm

bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
                              'darkorange'])
clf = best_gmm
bars = []

# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
    xpos = np.array(n_components_range) + .2 * (i - 2)
    bars.append(plt.bar(xpos, bic[i * len(n_components_range):
                                  (i + 1) * len(n_components_range)],
                        width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
    .2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)

# Plot the winner
splot = plt.subplot(2, 1, 2)
Y_ = clf.predict(X)
for i, (mean, cov, color) in enumerate(zip(clf.means_, clf.covariances_,
                                           color_iter)):
    v, w = linalg.eigh(cov)
    if not np.any(Y_ == i):
        continue
    plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)

    # Plot an ellipse to show the Gaussian component
    angle = np.arctan2(w[0][1], w[0][0])
    angle = 180. * angle / np.pi  # convert to degrees
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
    ell.set_clip_box(splot.bbox)
    ell.set_alpha(.5)
    splot.add_artist(ell)

plt.xticks(())
plt.yticks(())
plt.title('Selected GMM: full model, 2 components')
plt.subplots_adjust(hspace=.35, bottom=.02)
plt.show()

最后，从下图中，您可以清楚地看到如何

score 0 · Accepted Answer

这是一个如何使用 plotGMM 可视化集群的示例。重现的代码如下：

require(quantmod)
SCHB  <- fortify(getSymbols('SCHB', auto.assign=FALSE))
set.seed(730) # for reproducibility
mixmdl <- mixtools::normalmixEM(Cl(SCHB), k = 5); plot_GMM(mixmdl, k = 5) # 5 clusters
plot_GMM(mixmdl, k = 5)

我希望这会有所帮助。哦，为了用 ggplot2 绘制时间序列，你应该利用 ggplot2 的fortify 函数。希望有帮助。

r - R中的聚类时间序列-K均值准确吗？

2 回答 2

Related

Reference