我为我的问题编写了自己的聚类算法(不好,我知道)。它工作得很好,但可以工作得更快。
算法将值列表 (1D) 作为输入,其工作原理如下:
- 对于每个集群,计算到最近邻集群的距离
- 选择与邻居 B 距离最小的集群 A
- 如果 A 和 B 之间的距离小于阈值,则返回
- 结合A和B
- 转到 1。
我可能在这里重新发明了一个轮子..
这是我的蛮力代码,如何使它更快?如果有现成的东西,我已经安装了 Scipy 和 Numpy
#cluster center as simple average value
def cluster_center(cluster):
return sum(cluster) / len(cluster)
#Distance between clusters
def cluster_distance(a, b):
return abs(cluster_center(a) - cluster_center(b))
while True:
cluster_distances = []
#If nothing to cluster, ready
if len(clusters) < 2:
break
#Go thru all clusters, calculate shortest distance to neighbor
for cluster in clusters:
cluster_distances.append((cluster, sorted([(cluster_distance(cluster, c), c) for c in clusters if c != cluster])[0]))
#Find out closest pair
cluster_distances.sort(cmp=lambda a,b:cmp(a[1], b[1]))
#Check if distance is under threshold 15
if cluster_distances[0][1][0] < 15:
a = cluster_distances[0][0]
b = cluster_distances[0][1][1]
#Combine clusters (combine lists)
a.extend(b)
#Form a new cluster list
clusters = [c[0] for c in cluster_distances if c[0] != b]
else:
break