python - 有能力的k-means聚类？

Question

我是算法和优化的新手。
我正在尝试实施capacitated k-means，但到目前为止尚未解决且结果不佳。
这被用作 CVRP 模拟（有能力的车辆路线问题）的一部分。
我很好奇我是否错误地解释了引用的算法。

参考：“用于容量聚类问题的改进 K-Means 算法”（Geetha、Poonthalir、Vanathi）

模拟的 CVRP 有 15 个客户，有 1 个仓库。
每个客户都有欧几里得坐标 (x,y) 和需求。
有3辆车，每辆车可容纳90人。

因此，capacitated k-means 试图将 15 个客户聚集到 3 辆车中，每个集群中的总需求不能超过车辆容量。

更新：

在引用的算法中，我无法获取有关代码在用完“下一个最近的质心”时必须做什么的任何信息。
也就是说，当检查了所有“最近的质心”时，在下面的步骤 14.b中，customers[1]仍然未分配。

这会导致未分配索引为 1 的客户。
注：customer[1]为需求量最大的客户（30）。
Q：当满足这个条件时，代码应该怎么做？

这是我对引用算法的解释，请更正我的代码，谢谢。

给定n请求者（客户）、n=customerCount和一个仓库
n 要求，
n 坐标 (x,y)
计算集群的数量，k=（所有需求的总和）/vehicleCapacity
选择初始质心，
5.a。根据demand, 降序排列客户 = d_customers,
5.b. 从中选择k第一批客户d_customers作为初始质心 = centroids[0 .. k-1],
创建二进制矩阵bin_matrix, 维度 = (customerCount) x (k),
6.a. bin_matrix用全零填充
开始 WHILE 循环，条件 = WHILE not converged。
7.a.converged = False
开始 FOR 循环，条件 = FOR each customers，
8.a。客户指数 = i
customers[i]计算从到所有的欧几里得距离centroids=> edist
9.a。按升序排序edist，
9.b。选择centroid最接近的距离=closest_centroid
启动 WHILE 循环，条件 =while customers[i]未分配给任何集群。
将所有其他未分配的客户分组 = G,
11.a. 考虑closest_centroid为的质心G。
计算Pi每个 12.acustomers的优先级G。
优先级Pi = (distance from customers[i] to closest_cent) / demand[i]
12.b。选择优先级最高的客户Pi。
12.c。具有最高优先级的客户的索引 = hpc
12.d。问：如果找不到最高优先级的客户，我们该怎么办？
如果可能，分配customers[hpc]给。 13.a. =的需求， 13.b. 质心成员的所有需求的总和 = , 13.c. .. 13.d。分配给 13.e. 更新，行索引= ，列索引= ，设置为。centroids[closest_centroid]
customers[hpc]d1
dtot
IF (d1 + dtot) <= vehicleCapacity, THEN
customers[hpc]centroids[closest_centroid]
bin_matrixhpcclosest_centroid1
如果customers[i]是（仍然）not assigned到任何集群，那么......
14.a。选择next nearest centroid, 与的下一个最近距离edist。
14.b。问：如果没有下一个最近的质心，那我们该怎么办？
通过比较先前的矩阵和更新的矩阵 bin_matrix 计算收敛。
15.a. 如果没有变化bin_matrix，则设置converged = True。
否则，new centroids从更新的集群计算。
16.a. centroids' coordinates根据每个集群的成员计算新的。
16.b。sum_x=所有x-coordinate集群的总和members，
16.c。num_c= 总数customers (members)集群中所有人的数量，
16.d。x-coordinate集群的新质心= sum_x / num_c。
16.e. 使用相同的公式，计算y-coordinate集群的新质心 = sum_y / num_c。
迭代主 WHILE 循环。

我的代码总是在步骤 14.b以未分配的客户结尾。
那是当customers[i]仍然没有分配给任何质心时，它已经用完了“下一个最近的质心”。

并且由此产生的集群很差。输出图：

- 图中，星形为质心，方形为仓库。
在图中，客户标记为“1”，需求=30 始终以没有分配的集群结束。

程序的输出，

k_cluster 3
idx [ 1 -1  1  0  2  0  1  1  2  2  2  0  0  2  0]
centroids [(22.6, 29.2), (34.25, 60.25), (39.4, 33.4)]
members [[3, 14, 12, 5, 11], [0, 2, 6, 7], [9, 8, 4, 13, 10]]
demands [86, 65, 77]

第一个和第三个集群计算不佳。未分配
idx索引为“ ”的（1-1 ”的 ( )

问：我的解释和实施有什么问题？
任何更正，建议，帮助，将不胜感激，在此先感谢您。

这是我的完整代码：

#!/usr/bin/python
# -*- coding: utf-8 -*-
# pastebin.com/UwqUrHhh
# output graph: i.imgur.com/u3v2OFt.png

import math
import random
from operator import itemgetter
from copy import deepcopy
import numpy
import pylab

# depot and customers, [index, x, y, demand]
depot = [0, 30.0, 40.0, 0]
customers = [[1, 37.0, 52.0, 7], \
             [2, 49.0, 49.0, 30], [3, 52.0, 64.0, 16], \
             [4, 20.0, 26.0, 9], [5, 40.0, 30.0, 21], \
             [6, 21.0, 47.0, 15], [7, 17.0, 63.0, 19], \
             [8, 31.0, 62.0, 23], [9, 52.0, 33.0, 11], \
             [10, 51.0, 21.0, 5], [11, 42.0, 41.0, 19], \
             [12, 31.0, 32.0, 29], [13, 5.0, 25.0, 23], \
             [14, 12.0, 42.0, 21], [15, 36.0, 16.0, 10]]
customerCount = 15
vehicleCount = 3
vehicleCapacity = 90
assigned = [-1] * customerCount

# number of clusters
k_cluster = 0
# binary matrix
bin_matrix = []
# coordinate of centroids
centroids = []
# total demand for each cluster, must be <= capacity
tot_demand = []
# members of each cluster
members = []
# coordinate of members of each cluster
xy_members = []

def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

# capacitated k-means clustering
# http://www.dcc.ufla.br/infocomp/artigos/v8.4/art07.pdf
def cap_k_means():
    global k_cluster, bin_matrix, centroids, tot_demand
    global members, xy_members, prev_members

    # calculate number of clusters
    tot_demand = sum([c[3] for c in customers])
    k_cluster = int(math.ceil(float(tot_demand) / vehicleCapacity))
    print 'k_cluster', k_cluster

    # initial centroids = first sorted-customers based on demand
    d_customers = sorted(customers, key=itemgetter(3), reverse=True)
    centroids, tot_demand, members, xy_members = [], [], [], []
    for i in range(k_cluster):
        centroids.append(d_customers[i][1:3])   # [x,y]

        # initial total demand and members for each cluster
        tot_demand.append(0)
        members.append([])
        xy_members.append([])

    # binary matrix, dimension = customerCount-1 x k_cluster
    bin_matrix = [[0] * k_cluster for i in range(len(customers))]

    converged = False
    while not converged:  # until no changes in formed-clusters
        prev_matrix = deepcopy(bin_matrix)

        for i in range(len(customers)):
            edist = []  # list of distance to clusters

            if assigned[i] == -1:  # if not assigned yet
                # Calculate the Euclidean distance to each of k-clusters
                for k in range(k_cluster):
                    p1 = (customers[i][1], customers[i][2]) # x,y
                    p2 = (centroids[k][0], centroids[k][1])
                    edist.append((distance(p1, p2), k))

                # sort, based on closest distance
                edist = sorted(edist, key=itemgetter(0))

            closest_centroid = 0    # first index of edist
            # loop while customer[i] is not assigned
            while assigned[i] == -1:  
                # calculate all unsigned customers (G)'s priority
                max_prior = (0, -1)   # value, index
                for n in range(len(customers)):
                    pc = customers[n]

                    if assigned[n] == -1:   # if unassigned
                        # get index of current centroid
                        c = edist[closest_centroid][1]
                        cen = centroids[c]     # x,y

                        # distance_cost / demand
                        p = distance((pc[1], pc[2]), cen) / pc[3]

                        # find highest priority
                        if p > max_prior[0]:
                            max_prior = (p, n)  # priority,customer-index

                # if highest-priority is not found, what should we do ???
                if max_prior[1] == -1:   
                    break

                # try to assign current cluster to highest-priority customer
                hpc = max_prior[1]    # index of highest-priority customer
                c = edist[closest_centroid][1]   # index of current cluster

                # constraint, total demand in a cluster <= capacity
                if tot_demand[c] + customers[hpc][3] <= vehicleCapacity:
                    # assign new member of cluster
                    members[c].append(hpc)   # add index of customer

                    xy = (customers[hpc][1], customers[hpc][2])  # x,y
                    xy_members[c].append(xy)

                    tot_demand[c] += customers[hpc][3]
                    assigned[hpc] = c   # update cluster to assigned-customer

                    # update binary matrix
                    bin_matrix[hpc][c] = 1

                # if customer is not assigned then,
                if assigned[i] == -1:
                    if closest_centroid < len(edist)-1:
                        # choose the next nearest centroid
                        closest_centroid += 1

                    # if run out of closest centroid, what must we do ???
                    else:
                        break   # exit without centroid ???

            # end while
        # end for

        # Calculate the new centroid from the formed clusters
        for j in range(k_cluster):
            xj = sum([cn[0] for cn in xy_members[j]])
            yj = sum([cn[1] for cn in xy_members[j]])
            xj = float(xj) / len(xy_members[j])
            yj = float(yj) / len(xy_members[j])
            centroids[j] = (xj, yj)

        # calculate converged
        converged = numpy.array_equal(numpy.array(prev_matrix), numpy.array(bin_matrix))
    # end while

def clustering():
    cap_k_means()

    # debug plot
    idx = numpy.array([c for c in assigned])
    xy = numpy.array([(c[1], c[2]) for c in customers])

    COLORS = ["Blue", "DarkSeaGreen", "DarkTurquoise", 
          "IndianRed", "MediumVioletRed", "Orange", "Purple"]

    for i in range(min(idx), max(idx)+1):
        clr = random.choice(COLORS)
        pylab.plot(xy[idx==i, 0], xy[idx==i, 1], color=clr, \
            linestyle='dashed', \
            marker='o', markerfacecolor=clr, markersize=8)
    pylab.plot(centroids[:][0], centroids[:][1], '*k', markersize=12)
    pylab.plot(depot[1], depot[2], 'sk', markersize=12)

    for i in range(len(idx)):
        pylab.annotate(str(i), xy[i])

    pylab.savefig('clust1.png')
    pylab.show()

    return idx

def main():
    idx = clustering()
    print 'idx', idx
    print 'centroids', centroids
    print 'members', members
    print 'demands', tot_demand

if __name__ == '__main__':
    main()

score 1 · Accepted Answer

当总需求接近总容量时，这个问题开始出现在装箱方面。正如您所发现的，这种特定算法的贪心方法并不总是成功的。我不知道作者是否承认这一点，但如果他们不承认，审稿人应该已经注意到了。

如果您想继续使用这种算法，我会尝试使用整数编程将请求者分配给质心。

score 0 · Accepted Answer

你引用的论文没有详述所有细节

 if ri is not assigned then
     choose the next nearest centroid
 end if

在第 5 节末尾的算法中。

必须有一个下一个最近的质心 - 如果两个是等距的，我认为你选择哪个并不重要。

score 0 · Accepted Answer

固定大小聚类的一个常见问题是，您通常可以在输出中识别“交换”，其中在聚类之间交换 2 个点可以创建更好的解决方案。

我们可以通过将“找到将点分配给的集群”作为分配问题来改进参考论文中的约束 k-means 算法，而不是贪婪地选择最近的未满的。

解决此问题的常用方法是使用最小成本流算法。这样做的结果保证了没有任何“交换”可以改善结果。

幸运的是，有人已经实现了这个并为它创建了一个包：https ://github.com/joshlk/k-means-constrained

查看Bradley, PS, Bennett, KP 和 Demiriz, A. (2000)。受约束的 k 均值聚类。

附带说明一下，需求量很大的客户可能仍然无法分配到单个仓库，因此可能需要增加k的值 a 直到有可行的解决方案，或者“拆分”需求需要允许多个站点之间的客户。

python - 有能力的k-means聚类？

3 回答 3

Related

Reference