python - python用协方差最小的高斯混合模型（GMM）拟合加权数据

Question

我想使用 python 将高斯混合模型拟合到一组加权数据点。

我尝试了 sklearn.mixture.GMM() ，它工作得很好，除了它对所有数据点的权重相同。有谁知道在这种方法中为数据点分配权重的方法？我多次尝试使用数据点来“增加它们的权重”，但这对于大型数据集似乎无效。

我也想过自己实现 EM 算法，但这似乎比上面的 GMM 方法慢得多，并且会极大地增加大型数据集的计算时间。

我刚刚发现了 EM 算法 cv2.EM() 的 opencv 方法。这又可以正常工作，但与 sklearn.mixture.GMM 存在相同的问题，此外，似乎无法更改协方差允许的最小值。或者有没有办法将协方差最小值更改为例如 0.001？我希望可以使用probe参数为数据分配权重，但这似乎只是一个输出参数，对拟合过程没有影响，不是吗？使用 probs0 并通过使用 trainM 以 M 步骤启动算法也无济于事。对于 probs0，我使用了（数据点数量）x（GMM 组件数量）矩阵，其列相同，而数据点的加权参数被写入与数据点对应的行。这也没有解决问题。

有谁知道如何操作上述方法，或者是否有人知道另一种方法，以便 GMM 可以拟合加权数据？

score 1 · Accepted Answer

如果您仍在寻找解决方案，石榴现在支持在加权数据上训练 GMM。您需要做的就是在训练时传递一个权重向量，它会为您处理它。这是关于石榴中 GMM 的简短教程！

父 github 在这里：

https://github.com/jmschrei/pomegranate

具体教程在这里：

https://github.com/jmschrei/pomegranate/blob/master/tutorials/B_Model_Tutorial_2_General_Mixture_Models.ipynb

score 1 · Accepted Answer

接受 Jacobs 的建议，我编写了一个 pomegranate 实现示例：

import pomegranate
import numpy
import sklearn
import sklearn.datasets 

#-------------------------------------------------------------------------------
#Get data from somewhere (moons data is nice for examples)
Xmoon, ymoon = sklearn.datasets.make_moons(200, shuffle = False, noise=.05, random_state=0)
Moon1 = Xmoon[:100] 
Moon2 = Xmoon[100:] 
MoonsDataSet = Xmoon

#Weight the data from moon2 much higher than moon1:
MoonWeights = numpy.array([numpy.ones(100), numpy.ones(100)*10]).flatten()

#Make the GMM model using pomegranate
model = pomegranate.gmm.GeneralMixtureModel.from_samples(
    pomegranate.MultivariateGaussianDistribution,   #Either single function, or list of functions
    n_components=6,     #Required if single function passed as first arg
    X=MoonsDataSet,     #data format: each row is a point-coordinate, each column is a dimension
    )

#Force the model to train again, using additional fitting parameters
model.fit(
    X=MoonsDataSet,         #data format: each row is a coordinate, each column is a dimension
    weights = MoonWeights,  #List of weights. One for each point-coordinate
    stop_threshold = .001,  #Lower this value to get better fit but take longer. 
                            #   (sklearn likes better/slower fits than pomegrante by default)
    )

#Wrap the model object into a probability density python function 
#   f(x_vector)
def GaussianMixtureModelFunction(Point):
    return model.probability(numpy.atleast_2d( numpy.array(Point) ))

#Plug in a single point to the mixture model and get back a value:
ExampleProbability = GaussianMixtureModelFunction( numpy.array([ 0,0 ]) )
print ('ExampleProbability', ExampleProbability)

python - python用协方差最小的高斯混合模型（GMM）拟合加权数据

2 回答 2

Related

Reference