3

I am trying to run Kmeans using Hadoop. I want to save the centroids of the clusters calculated in the cleanup method of the Reducer to some file say centroids.txt. Now, I would like to know what will happen if multiple reducers' cleanup method starts at the same time and all of them try to write to this file simultaneously. Will it be handled internally? If not is there a way to synchronize this task?

Note that this is not my output file of reducer. It is an additional file that I am maintaining to keep track of the centroids. I am using BufferedWriter from the reducer's cleanup method to do this.

4

3 回答 3

3

是的你是对的。您无法使用现有框架实现这一目标。清理将被调用多次。您无法同步。您可以遵循的可能方法是

  1. 作业成功后调用合并。

    hadoop fs -getmerge <src> <localdst> [addnl]

    这里

2 明确指定输出文件的位置。使用此文件夹作为下一份工作的输入。

3 再连接一个 MR。其中map和reduce不改变数据,partitioner将所有数据分配给一个reducer

于 2014-04-26T09:26:37.157 回答
0

由于质心相对较少,您可以将它们写入zookeeper。如果你的读/写负载很高,你可能需要 HBase(你也可以在这里使用它,但这会有点过头了)

另请注意,Hadoop 上有几个 k-means 实现,例如Mahout。其中一些实现比使用 BSP的Apache Hama或在内存中运行的Spark等 map/reduce 更有效

于 2014-04-25T18:27:59.677 回答
0

每个 reducer 写入一个单独的文件。多个 reducer 永远不能修改同一个文件。

于 2014-04-25T17:09:36.927 回答