openturns - 如何在 OpenTURNS 中估计混合模型的参数？

Question

我想估计OpenTURNS中正态分布的混合模型的参数（即高斯随机变量的加权和的分布）。OpenTURNS 可以创建这样的混合，但它不能估计它的参数。此外，我需要将混合物创建为 OpenTURNS 分布，以便通过函数传播不确定性。

例如，我知道如何创建两个正态分布的混合：

import openturns as ot
mu1 = 1.0
sigma1 = 0.5
mu2 = 3.0
sigma2 = 2.0
weights = [0.3, 0.7]
n1 = ot.Normal(mu1, sigma1)
n2 = ot.Normal(mu2, sigma2)
m = ot.Mixture([n1, n2], weights)

在这个例子中，我想在给定的样本上估计mu1, sigma1, mu2, 。sigma2为了创建一个工作示例，很容易通过模拟生成一个示例。

s = m.getSample(100)

score 2 · Accepted Answer

您可以依靠 scikit-learnGaussianMixture来估计参数，然后使用它们在 OpenTURNS 中定义 Mixture 模型。

此后的脚本包含一个 Python 类，该类MixtureFactory估计 a 的参数scikitlearn GaussianMixture并输出 OpenTURNSMixture分布：

from sklearn.mixture import GaussianMixture
from sklearn.utils.validation import check_is_fitted
import openturns as ot
import numpy as np

class MixtureFactory(GaussianMixture):
    """
    Representation of a Gaussian mixture model probability distribution.

    This class allows to estimate the parameters of a Gaussian mixture
    distribution using scikit algorithms & provides openturns Mixture object.
    
    Read more in scikit learn user guide & openturns theory.

    Parameters:
    -----------
    n_components : int, defaults to 1.
        The number of mixture components.

    covariance_type : {'full' (default), 'tied', 'diag', 'spherical'}
        String describing the type of covariance parameters to use.
        Must be one of:

        'full'
            each component has its own general covariance matrix
        'tied'
            all components share the same general covariance matrix
        'diag'
            each component has its own diagonal covariance matrix
        'spherical'
            each component has its own single variance

    tol : float, defaults to 1e-3.
        The convergence threshold. EM iterations will stop when the
        lower bound average gain is below this threshold.

    reg_covar : float, defaults to 1e-6.
        Non-negative regularization added to the diagonal of covariance.
        Allows to assure that the covariance matrices are all positive.

    max_iter : int, defaults to 100.
        The number of EM iterations to perform.

    n_init : int, defaults to 1.
        The number of initializations to perform. The best results are kept.

    init_params : {'kmeans', 'random'}, defaults to 'kmeans'.
        The method used to initialize the weights, the means and the
        precisions.
        Must be one of::

            'kmeans' : responsibilities are initialized using kmeans.
            'random' : responsibilities are initialized randomly.

    weights_init : array-like, shape (n_components, ), optional
        The user-provided initial weights, defaults to None.
        If it None, weights are initialized using the `init_params` method.

    means_init : array-like, shape (n_components, n_features), optional
        The user-provided initial means, defaults to None,
        If it None, means are initialized using the `init_params` method.

    precisions_init : array-like, optional.
        The user-provided initial precisions (inverse of the covariance
        matrices), defaults to None.
        If it None, precisions are initialized using the 'init_params' method.
        The shape depends on 'covariance_type'::

            (n_components,)                        if 'spherical',
            (n_features, n_features)               if 'tied',
            (n_components, n_features)             if 'diag',
            (n_components, n_features, n_features) if 'full'

    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

    warm_start : bool, default to False.
        If 'warm_start' is True, the solution of the last fitting is used as
        initialization for the next call of fit(). This can speed up
        convergence when fit is called several times on similar problems.
        In that case, 'n_init' is ignored and only a single initialization
        occurs upon the first call.
        See :term:`the Glossary <warm_start>`.

    verbose : int, default to 0.
        Enable verbose output. If 1 then it prints the current
        initialization and each iteration step. If greater than 1 then
        it prints also the log probability and the time needed
        for each step.

    verbose_interval : int, default to 10.
        Number of iteration done before the next print.
    """
    def __init__(self, n_components=2, covariance_type='full', tol=1e-6,
                 reg_covar=1e-6, max_iter=1000, n_init=1, init_params='kmeans',
                 weights_init=None, means_init=None, precisions_init=None,
                 random_state=41, warm_start=False,
                 verbose=0, verbose_interval=10):
        super().__init__(n_components, covariance_type, tol, reg_covar,
                         max_iter, n_init, init_params, weights_init, means_init,
                         precisions_init, random_state, warm_start, verbose, verbose_interval)
    def fit(self, X):
        """
        Fit the mixture model parameters.

        EM algorithm is applied here to estimate the model parameters and build a
        Mixture distribution (see openturns mixture). 
        The method fits the model ``n_init`` times and sets the parameters with
        which the model has the largest likelihood or lower bound. Within each
        trial, the method iterates between E-step and M-step for ``max_iter``
        times until the change of likelihood or lower bound is less than
        ``tol``, otherwise, a ``ConvergenceWarning`` is raised.
        If ``warm_start`` is ``True``, then ``n_init`` is ignored and a single
        initialization is performed upon the first call. Upon consecutive
        calls, training starts where it left off.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            List of n_features-dimensional data points. Each row
            corresponds to a single data point.

        Returns
        -------
        """
        data = np.array(X)
        # Evaluate the model parameters.
        super().fit(data)
        # openturns mixture
    
        # n_components ==> weight of size n_components
        weights = self.weights_
        n_components = len(weights)
        # Create ot distribution
        collection = n_components * [0]
        # Covariance matrices
        cov = self.covariances_
        mu = self.means_
        # means : n_components x n_features
        n_components, n_features = mu.shape

        # Following the type of covariance, we define the collection of gaussians

        # Spherical : C_k = Identity * sigma_k
        if self.covariance_type is 'spherical':
            c = ot.CorrelationMatrix(n_features)
            for l in range(n_components):
                sigma = np.sqrt(cov[l])
                collection[l] = ot.Normal(list(mu[l]), [ sigma ] * n_features  , c)

        elif self.covariance_type is 'diag' :
            for l in range(n_components):
                c = ot.CovarianceMatrix(n_features)
                for i in range(n_features):
                    c[i,i] = cov[l, i]
                collection[l] = ot.Normal(list(mu[l]), c)

        elif self.covariance_type == 'tied':
            # Same covariance for all clusters
            c = ot.CovarianceMatrix(n_features)
            for i in range(n_features):
                for j in range(0, i+1):
                    c[i,j] = cov[i,j]
            # Define the collection with the same covariance
            for l in range(n_components):
                collection[l] = ot.Normal(list(mu[l]), c)
            
        else:
            n_features = cov.shape[1]
            for l in range(n_components):
                c = ot.CovarianceMatrix(n_features)
                for i in range(n_features):
                    for j in range(0, i+1):
                        c[i,j] = cov[l][i,j]
                collection[l] = ot.Normal(list(mu[l]), c)

        self._mixture = ot.Mixture(collection, weights)
        return self

    def get_mixture(self):
        """
        Returns the Mixture object
        """
        check_is_fitted(self)
        return self._mixture

if __name__ == "__main__":
    mu1 = 1.0
    sigma1 = 0.5
    mu2 = 3.0
    sigma2 = 2.0
    weights = [0.3, 0.7]
    n1 = ot.Normal(mu1, sigma1)
    n2 = ot.Normal(mu2, sigma2)
    m = ot.Mixture([n1, n2], weights)
    x = m.getSample(1000)
    est_dist = MixtureFactory(random_state=1)
    est_dist.fit(x)
    print(est_dist.get_mixture())

score 1 · Accepted Answer

我实际上已经尝试过这种方法，并且效果很好。最重要的是，通过 SciKit GMM 拟合模型以及通过 OpenTurns 进行的后期调整非常快。我建议未来的用户测试多个分量和协方差矩阵结构，因为它不会花费很多时间，并且可以显着提高对数据的拟合优度。

感谢你的回答。

score 0 · Accepted Answer

下面是一个替代方案。第一步创建一个新的 GaussianMixture 类，派生自 PythonDistribution。关键是实现computeLogPDF方法和set/getParameters方法。请注意，混合物的这种参数化只有一个权重 w。

class GaussianMixture(ot.PythonDistribution):
    def __init__(self, mu1 = -5.0, sigma1 = 1.0, \
                 mu2 = 5.0, sigma2 = 1.0, \
                 w = 0.5):
        super(GaussianMixture, self).__init__(1)
        if w < 0.0 or w > 1.0:
            raise ValueError('The weight is not in [0, 1]. w=%s.' % (w))
        self.mu1 = mu2
        self.sigma1 = sigma1
        self.mu2 = mu2
        self.sigma2 = sigma2
        self.w = w
        collDist = [ot.Normal(mu1, sigma1), ot.Normal(mu2, sigma2)]
        weight = [w, 1.0 - w]
        self.distribution = ot.Mixture(collDist, weight)

    def computeCDF(self, x):
        p = self.distribution.computeCDF(x)
        return p

    def computePDF(self, x):
        p = self.distribution.computePDF(x)
        return p

    def computeQuantile(self, prob, tail = False):
        quantile = self.distribution.computeQuantile(prob, tail)
        return quantile

    def getSample(self, size):
        X = self.distribution.getSample(size)
        return X

    def getParameter(self):
        parameter = ot.Point([self.mu1, self.sigma1, \
                              self.mu2, self.sigma2, \
                              self.w])
        return parameter

    def setParameter(self, parameter):
        [mu1, sigma1, mu2, sigma2, w] = parameter
        self.__init__(mu1, sigma1, mu2, sigma2, w)
        return parameter

    def computeLogPDF(self, sample):
        logpdf = self.distribution.computeLogPDF(sample)
        return logpdf

为了创建分布，我们使用Distribution类：

gm = ot.Distribution(GaussianMixture())

估计这个分布的参数很简单MaximumLikelihoodFactory。但是，我们必须设置边界，因为 sigma 不能为负，并且 w 在 (0, 1) 中。

factory = ot.MaximumLikelihoodFactory(gm)
lowerBound = [0.0, 1.e-6, 0.0, 1.e-6, 0.01]
upperBound = [0.0, 0.0,   0.0, 0.0,   0.99]
finiteLowerBound = [False, True, False, True, True]
finiteUpperBound = [False, False, False, False, True]
bounds = ot.Interval(lowerBound, upperBound, finiteLowerBound, finiteUpperBound)
factory.setOptimizationBounds(bounds)

然后我们配置优化求解器。

solver = factory.getOptimizationAlgorithm()
startingPoint = [-4.0, 1.0, 7.0, 1.5, 0.3]
solver.setStartingPoint(startingPoint)
factory.setOptimizationAlgorithm(solver)

估计参数是基于build方法。

distribution = factory.build(sample)

此实现有两个限制。

首先，由于 PythonDistribution 的限制（请参阅https://github.com/openturns/openturns/issues/1391），它没有达到应有的速度。
估计参数可能很困难，因为问题可能具有无法使用MaximumLikelihoodFactory. 这类任务一般用 EM 算法来完成。

score 0 · Accepted Answer

这是一个纯 OpenTURNS 解决方案。它可能比基于 scikit-learn 的方法慢，但更通用：您可以使用它来估计任何混合模型的参数，不一定是正态分布的混合。

这个想法是从Mixture对象中检索对数似然函数并将其最小化。在下文中，让我们假设这s是我们想要拟合混合物的样本。

首先，我们需要构建我们想要估计参数的混合物。我们可以指定任何有效的参数集，没关系。在您的示例中，您需要 2 个正态分布的混合。

mixture = ot.Mixture([ot.Normal()]*2, [0.5]*2)

有一个小障碍。所有权重总和为 1，因此其中一个由其他权重确定：不得允许求解器自由设置它。OpenTURNS的参数顺序Mixture如下：

第一次分布的权重；
第一分布的参数；
第二次分布的权重；
第二个分布的参数：
...

您可以使用来查看所有参数mixture.getParameter()及其名称mixture.getParameterDescription()。以下是一个辅助函数：

将包含除第一个分布的权重之外的所有混合参数的列表作为输入；
输出 aPoint包含所有参数，包括第一个分布的权重。

def full(params):
    """
    Point of all mixture parameters from a list that omits the first weight.
    
    """
    params = ot.Point(params)
    aux_mixture = ot.Mixture(mixture)
    dist_number = aux_mixture.getDistributionCollection().getSize()

    index = aux_mixture.getDistributionCollection()[0].getParameter().getSize()
    list_weights = []
    for num in range(1, dist_number):
        list_weights.append(params[index])
        index += 1 + aux_mixture.getDistributionCollection()[num].getParameter().getSize()

    complementary_weight = ot.Point([abs(1.0 - sum(list_weights))])
    complementary_weight.add(params)
    return complementary_weight

下一个函数计算给定参数列表的对数似然的反函数（第一个权重除外）。为了数值稳定性，它将这个值除以观察次数。

我们将最小化此函数以找到最大似然估计。

def minus_log_pdf(params):
    """
    - log-likelihood of a list of parameters excepting the first weight
    divided by the number of observations
    """
    aux_mixture = ot.Mixture(mixture)

    full_params = full(params)

    try:
        aux_mixture.setParameter(full_params)
    except TypeError:
        # case where the proposed parameters are invalid:
        # return a huge value
        return [ot.SpecFunc.LogMaxScalar]
    res = - aux_mixture.computeLogPDF(s).computeMean()
    return res

为了使用 OpenTURNS 优化工具，我们需要把这个函数变成一个PythonFunction对象。

OT_minus_log_pdf = ot.PythonFunction(mixture.getParameter().getSize()-1, 1, minus_log_pdf)

Cobyla 通常擅长似然优化。

problem = ot.OptimizationProblem(OT_minus_log_pdf)
algo = ot.Cobyla(problem)

为了减少 Cobyla 陷入局部最小值的机会，我们将使用MultiStart. 我们选择一组起始参数并随机更改权重。下面的辅助函数使它变得容易：

def random_weights(params, nb):
    """
    List of nb Points representing mixture parameters with randomly varying weights.
    """
    aux_mixture = ot.Mixture(mixture)
    full_params = full(params)
    aux_mixture.setParameter(full_params)
    list_params = []
    for num in range(nb):
        dirichlet = ot.Dirichlet([1.0] * aux_mixture.getDistributionCollection().getSize()).getRealization()
        dirichlet.add(1.0 - sum(dirichlet))
        aux_mixture.setWeights(dirichlet)
        list_params.append(aux_mixture.getParameter()[1:])
    return list_params

我们选择 10 个起点，并将对数似然的最大评估次数从 100（默认）增加到 10000。

init = mixture.getParameter()[1:]
starting_points = random_weights(init, 10)
algo_multistart = ot.MultiStart(algo, starting_points)
algo_multistart.setMaximumEvaluationNumber(10000)

让我们运行求解器并检索结果。

algo_multistart.run()
result = algo_multistart.getResult()

剩下的就是将mixture's 参数设置为最佳值。我们一定不要忘记把第一个重量加回来！

optimal_parameters = result.getOptimalPoint()
mixture.setParameter(full(optimal_parameters))

openturns - 如何在 OpenTURNS 中估计混合模型的参数？

4 回答 4

Related

Reference