0

我必须实现 DBSCAN 算法。假设从这个伪代码开始

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)

expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
      if P' is not visited
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      if P' is not yet member of any cluster
         add P' to cluster C

regionQuery(P, eps)
   return all points within P's eps-neighborhood

我的代码必须在带有 Ubuntu Linux 64 位的Amazon EC2实例上运行。

函数regionQuery查询MongoDB数据库以获取 P 的 eps-neighborhood 内的所有点。

那么,根据您的说法,实现它以提高性能的最佳编程语言是什么?CPHPJava(我不认为)?

4

3 回答 3

4

我假设您有很多积分并且需要快速获得结果 - 否则您几乎可以使用任何东西。

对我来说,这似乎是 map-reduce 的工作

地图部分将是“针对每个未访问点”的循环,并且应该发出包含邻居、候选集群和其他任何内容的数据构造。如果点被归类为噪声,它不应该发出任何声音。

集群扩展将进入 reduce 并可能完成部分 - 语言选择也将是 javascript,一切都将在 mongo 内部发生

于 2012-05-27T15:58:25.140 回答
3

谷歌搜索“parallel DBSCAN”,你会发现很多文章讨论了如何并行化这个算法。通常,它会相当大地改变算法,例如它需要合并集群。

Canopy 预聚类也可能是 DBSCAN 的一个很好的预处理步骤。

于 2012-06-02T09:55:27.743 回答
1

我忘了回答我自己的问题。我终于实现了 DBSCAN 算法的 MapReduce 版本。你可以在这里找到它(Hadoop)。

这是它如何工作的伪代码:

function map(P, eps, MinPts)
    if P is unvisited then
        mark P as visited
        NeighborPts = regionQuery(P, eps)
        if sizeof(NeighborPts) < MinPts then
            do nothing
        else
            mark P as clusterized
            prepare the key
            create new cluster C
            C.neighborPoints = NeighborPts
            C.points = P
            emit(key, C)

function reduce(key, clusters, eps, MinPts)
    finalC is the final cluster
    for all C in clusters do
        finalC.points = finalC.points ∪ C.points
        for all P in C.neighborPoints do
            if P′ is not visited then
                mark P′ as visited
                NeighborPts′ = regionQuery(P′,eps)
                if sizeof(NeighborPts′) ≥ MinPts then
                    NeighborPts = NeighborPts ∪ NeighborPts′
                end if
            end if
            if P′ is not yet member of any cluster then
                add P′ to cluster C
            end if
于 2012-10-05T15:16:59.353 回答