python - Python sklearn.mixture.GMM 对扩展不健壮？

Question

我在 Python 中使用sklearn.mixture.GMM，结果似乎取决于数据缩放。在以下代码示例中，我更改了整体缩放比例，但不更改尺寸的相对缩放比例。然而，在三种不同的缩放设置下，我得到了完全不同的结果：

from sklearn.mixture import GMM
from numpy import array, shape
from numpy.random import randn
from random import choice

# centroids will be normally-distributed around zero:
truelumps = randn(20, 5) * 10

# data randomly sampled from the centroids:
data = array([choice(truelumps) + randn(5) for _ in xrange(1000)])

for scaler in [0.01, 1, 100]:
    scdata = data * scaler
    thegmm = GMM(n_components=10)
    thegmm.fit(scdata, n_iter=1000)
    ll = thegmm.score(scdata)
    print sum(ll)

这是我得到的输出：

GMM(cvtype='diag', n_components=10)
7094.87886779
GMM(cvtype='diag', n_components=10)
-14681.566456
GMM(cvtype='diag', n_components=10)
-37576.4496656

原则上，我认为整体数据缩放并不重要，总对数似然每次都应该相似。但也许我忽略了一个实施问题？

score 4 · Accepted Answer

I've had an answer via the scikit-learn mailing list: in my code example, the log-likelihood should indeed vary with scale (because we're evaluating point likelihoods, not integrals), by a factor relating to log(scale). So I think my code example in fact shows GMM giving correct results.

score 2 · Accepted Answer

我认为 GMM 是尺度相关的（例如 k-means），因此建议按照文档的预处理章节中的说明对输入进行标准化。

python - Python sklearn.mixture.GMM 对扩展不健壮？

2 回答 2

Related

Reference