mahout - Mahout KMeans 生成的簇数比我的初始 K 设置翻倍

Question

我是 Mahout 的初学者，我使用 Mahout 0.8 并遵循https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html中的教程

当我使用： mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata -o output -t1 20 -t2 50 -k 5 -x 20 -ow

然后使用 clusterdump 提取集群中心：

mahout clusterdump --input output/clusters-20-final --output /media/synthetic_control.center

在 synthesis_control.center 文件中：

VL-585{n=50 c=[29.832, 29.589, 29.405, 28.516, 29.600, ….] r=[3.152, 3.518, 3.292, …]}

VL-591{n=197 c=[29.984, 29.681,…] r=[3.602, 3.558, 3.364,…]}

VL-595{n=203 c=[….] r=[….]}

VL-597{n=61 c=[….] r=[….]}

VL-599{n=43 c=[….] r=[….]}

VL-585{n=1 c=[….] r=[….]}

VL-591{n=27 c=[….] r=[….]}

VL-595{n=1 c=[….] r=[….]}

VL-597{n=1 c=[….] r=[….]}

VL-599{n=16 c=[….] r=[….]}

似乎 kmean 生成了 10 个集群，但我对 k 的初始设置是 5。

我也尝试了其他 k，它总是生成双倍的集群。

谁能帮我这个？非常感谢！

score 1 · Accepted Answer

哈哈！最后，看了代码，发现mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job的bug！！

事情是这样的：在syntheticcontrol.kmeans.Job中，如果用户设置了k，那么job不会在kmeans之前运行canopy clustering，而是直接运行kmean。运行kmean时，它需要每个簇的初始中心，所以它使用RandomSeedGenerator随机生成每个簇中心并将这个文件（part-randomSeed）放到output/clusters-0文件夹，在这个kmean之后首先使用这些中心来分类所有点并更新集群中心并将这些中心放入 output/clusters-0 文件夹。因此，在 clusters-0 文件夹中，有两组中心！因此，第一次迭代将读取双倍集群！这就是为什么这项工作总是产生双倍的簇数！

解决方案：将部分随机种子保存到另一个文件夹。在 org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

第 142 行， Path clusters = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);

改成 Path clusters = new Path(output, "randomSeeds");

mahout - Mahout KMeans 生成的簇数比我的初始 K 设置翻倍

1 回答 1

Related

Reference