在从 Stackoverflow 学到了这么多之后,我终于有机会回馈了!与目前提供的方法不同的方法是重新标记集群以最大化对齐,然后比较变得容易。例如,如果一个算法将标签分配给一组六个项目,如 L1=[0,0,1,1,2,2] 而另一种算法分配 L2=[2,2,0,0,1,1],您希望这两个标签是等价的,因为 L1 和 L2 本质上是将项目分割成相同的集群。这种方法重新标记 L2 以最大化对齐,在上面的示例中,将导致 L2==L1。
我在“Menéndez, Héctor D. Agenetic approach to the graph and spectrum clustering problem. MS thesis. 2012”中找到了解决这个问题的方法。以下是使用 numpy 在 Python 中的实现。我对 Python 比较陌生,所以可能会有更好的实现,但我认为这可以完成工作:
def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to
maximize alignment of elements within each cluster. This method is
described in in Menéndez, Héctor D. A genetic approach to the graph and
spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
are consecutive integers starting with zero)
INPUTS:
clstr1 - The first clustering assignment
clstr2 - The second clustering assignment
OUTPUTS:
clstr2_temp - The second clustering assignment with clusters renumbered to
maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))
for i in range(K):
for j in range(K):
dcix = clstr1==i
dcjx = clstr2==j
dd = np.dot(dcix.astype(int),dcjx.astype(int))
simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
simdist_vec = np.reshape(simdist.T,(K**2,1))
I = np.argmax(simdist_vec)
xy = np.unravel_index(I,simdist.shape,order='F')
x = xy[0]
y = xy[1]
mask[x,y] = 1
simdist[x,:] = 0
simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
swapj = [swapJ[k]==i for i in clstr2]
clstr2_temp[swapj] = swapI[k]
return clstr2_temp