python - scipy.optimize + kmeans 聚类

Question

我为项目实施的 kmeans 聚类算法具有以下设置：

import numpy as np 
import scipy
import sys
import random
import matplotlib.pyplot as plt
import operator
class KMeansClass:
    #takes in an npArray like object
    def __init__(self,dataset,k):
        self.dataset=np.array(dataset)
        #initialize mins to maximum possible value
        self.min_x = sys.maxint
        self.min_y = sys.maxint
        #initialize maxs to minimum possible value
        self.max_x = -(sys.maxint)-1
        self.max_y = -(sys.maxint)-1
        self.k = k

        #a is the coefficient matrix that is continually updated as the centroids of the clusters change respectively.
        # It is an mxk matrix where each row corresponds to a training_instance and each column corresponds to a centroid of a cluster
        #Values are either 0 or 1. A value for a particular training_instance (data_point) is 1 only for that centroid to which the training_instance
        # has the least distance else the value is 0.
        self.a = np.zeros(shape=[self.dataset.shape[0],self.k])
        self.distanceMatrix = np.empty(shape =[self.dataset.shape[0],self.k])


        #initialize mu to zeros of the requisite shape array for now. Change this after implementing max and min methods.
        self.mu = np.empty(shape=[k,2])


        self.findMinMaxdataPoints()
        self.initializeCentroids()
        self.createDistanceMatrix()
        self.scatterPlotOfInitializedPoints()


    #pointa and pointb are npArray like vecors.
    def euclideanDistance(self,pointa,pointb):
        return  np.sqrt(np.sum((pointa - pointb)**2))

    """ Problem Initialization And Visualization Helper methods"""
    ##############################################################################
    #@param: dataset : list of tuples [(x1,y1),(x2,y2),...(xm,ym)]
    def findMinMaxdataPoints(self):
        for item in self.dataset:
            self.min_x = min(self.min_x,item[0])
            self.min_y = min(self.min_y,item[1])
            self.max_x = max(self.max_x,item[0])
            self.max_y = max(self.max_y,item[1])



    def initializeCentroids(self):
        for i in range(self.k):
            #each value of mu is a tuple with a random number between (min_x - max_x) and (min_y - max_y)
            self.mu[i] = (random.randint(self.min_x,self.max_x),random.randint(self.min_y,self.max_y))
            self.sortCentroids()   

        print self.mu

    def sortCentroids(self):

        #the following 3 lines of code are to ensure that the mu values are always sorted in ascending order first with respect to the
        #x values and then with respect to the y values.
        half_sorted = sorted(self.mu,key=operator.itemgetter(1))   #sort wrt y values
        full_sorted = sorted(half_sorted,key=operator.itemgetter(0)) #sort the y-sorted array wrt x-values
        self.mu = np.array(full_sorted)

    def scatterPlotOfInitializedPoints(self):
        plt.scatter([item[0] for item in self.dataset],[item[1] for item in self.dataset],color='b')
        plt.scatter([item[0] for item in self.mu],[item[1] for item in self.mu],color='r')
        plt.show()

    ###############################################################################

    #minimizing euclidean distance is the same as minimizing the square of the euclidean distance.
    def calcSquareEuclideanDistanceBetweenTwoPoints(point_a,point_b):
        return np.sum((pointa-pointb)**2)

    def createDistanceMatrix(self):
        for i in range(self.dataset.shape[0]):
            for j in range(self.k):
                self.distanceMatrix[i,j] = calcSquareEuclideanDistanceBetweenTwoPoints(self.dataset[i],self.mu[j])

    def createCoefficientMatrix(self):
        for i in range(self.dataset.shape[0]):
            self.a[i,self.distanceMatrix[i].argmin()] = 1

    #update functions for CoefficientMatrix and Centroid values:
    def updateCoefficientMatrix(self):
        for i in range(self.dataset.shape[0]):
            self.a[i,self.distanceMatrix[i].argmin()]= 1

    def updateCentroids(self):
        for j in range(self.k):
            non_zero_indices = np.nonzero(self.a[:,j])
            avg = 0
            for i in range(len(non_zero_indices[0])):
                avg+=self.a[non_zero_indices[0][i],j]

            self.mu[j] =  avg/len(non_zero_indices[0])

    ############################################################

    def lossFunction(self):
        loss=0;
        for j in range(self.k):
            #vectorized this implementation.
            loss+=np.sum(np.dot(self.a[:,j],self.distanceMatrix[:,j]))
        return loss

在这里，我的问题与 lossFunction 以及如何将其与 scipy.optimize 包一起使用。我想通过执行以下步骤迭代地最小化损失函数：

 Repeat until convergence:
      a> Optimize 'a' by keeping mu constant    ( I have an        
         updateCoefficientMatrix method for updating 'a' matrix which is an  
         mXk matrix where we have m training instances and k clusters.)
      b> Optimize 'mu' by keeping 'a' constant (I have an updateCentroids 
         method to do this. where mu is a mXk matrix wherein m is number of 
         training instances and k is the number of clusters and the number of  
         centroids)

但是我对使用 scipy.optimize 包非常陌生，所以我写信是为了寻求有关如何调用scipy.optimize来实现上述优化目标的帮助？

基本上我有 2m个 xk矩阵，我想lossFunction()通过首先优化一个mxk矩阵保持另一个不变，然后在后续步骤优化第二个矩阵保持第一个常数来最小化 a。这可以被认为是期望最大化问题的一个特例，但不幸的是，到目前为止我还没有完全理解文档试图说的内容，因此我想我会求助于 SO。

提前致谢！

这是课堂作业的一部分，所以请不要发布代码！任何指导或解释将不胜感激。

score 0 · Accepted Answer

使用scipy.optimize.minimize不同的目标函数两次。

首先使用作为参数的目标函数运行优化a，并返回目标值。

作为第二步，在第二个作为参数scipy.optimize.minimize的目标函数上运行第二次。mu

在编写目标函数时，请记住 Python 具有嵌套函数，这避免了传递mu（在第一种情况下）或a（在第二种情况下）作为附加参数的需要；虽然它可以通过minimize(..., args=[mu])and来完成minimize(..., args=[a])。

在 for 循环中重复两步过程，直到答案满足您的收敛条件。

python - scipy.optimize + kmeans 聚类

1 回答 1

Related

Reference