algorithm - 如何找到一组数据点的中心？

Question

假设我在过去一年中每天都绘制了一架直升机的位置，并得出了以下地图：

任何看到这个的人都会告诉我这架直升机是在芝加哥以外的地方。

如何在代码中找到相同的结果？

我正在寻找这样的东西：

$geoCodeArray = array([GET=http://pastebin.com/grVsbgL9]);
function findHome($geoCodeArray) {
    // magic
    return $geoCode;
}

最终产生这样的东西：

地图-首页

更新：示例数据集

这是带有示例数据集的地图：http: //batchgeo.com/map/c3676fe29985f00e1605cd4f86920179

这是 150 个地理编码的 pastebin：http: //pastebin.com/grVsbgL9

以上包含 150 个地理编码。前 50 个位于芝加哥附近的几个集群中。其余的分散在全国各地，包括纽约、洛杉矶和旧金山的一些小集群。

我有大约一百万（严重）这样的数据集，我需要遍历并确定最有可能的“家”。非常感谢您的帮助。

更新 2：飞机改用直升机

飞机概念引起了对实体机场的过多关注。坐标可以是世界上的任何地方，而不仅仅是机场。让我们假设它是一架不受物理、燃料或其他任何东西约束的超级直升机。它可以降落在它想要的地方。;)

score 15 · Accepted Answer

通过将纬度和经度转换为笛卡尔坐标，即使这些点分散在地球上，以下解决方案也有效。它执行一种 KDE（内核密度估计），但在第一遍中，内核的总和仅在数据点处进行评估。应选择适合该问题的内核。在下面的代码中，我可以开玩笑/自以为是地称之为 Trossian，即 d≤h 为 2-d²/h²，d>h 为 h²/d²（其中 d 是欧几里得距离，h 是“带宽” $global_kernel_radius） , 但它也可以是高斯 (e ^-d²/2h² )、Epanechnikov 内核 (1-d²/h² 对于 d<h, 否则为 0) 或其他内核。可选的第二遍通过对局部网格上的独立内核求和，或通过计算质心，在局部细化搜索，$local_grid_radius.

本质上，每个点将它周围的所有点（包括它自己）相加，如果它们更接近（通过钟形曲线），则对它们进行加权，并且还通过可选的权重数组对它们进行加权$w_arr。获胜者是总和最大的点。一旦找到了获胜者，我们正在寻找的“家”可以通过在获胜者周围局部重复相同的过程（使用另一个钟形曲线）来找到，或者可以估计为所有点的“质心”在获胜者的给定半径内，其中半径可以为零。

该算法必须通过选择适当的内核、选择如何在本地细化搜索以及调整参数来适应问题。对于示例数据集，第一遍的 Trossian 内核和第二遍的 Epanechnikov 内核，所有 3 个半径都设置为 30 mi，网格步长为 1 mi 可能是一个很好的起点，但前提是两个子芝加哥的集群应该被视为一个大集群。否则必须选择较小的半径。

function find_home($lat_arr, $lng_arr, $global_kernel_radius,
                                       $local_kernel_radius,
                                       $local_grid_radius, // 0 for no 2nd pass
                                       $local_grid_step,   // 0 for centroid
                                       $units='mi',
                                       $w_arr=null)
{
   // for lat,lng <-> x,y,z see http://en.wikipedia.org/wiki/Geodetic_datum
   // for K and h see http://en.wikipedia.org/wiki/Kernel_density_estimation

   switch (strtolower($units)) {
      /*  */case 'nm' :
      /*or*/case 'nmi': $m_divisor = 1852;
      break;case  'mi': $m_divisor = 1609.344;
      break;case  'km': $m_divisor = 1000;
      break;case   'm': $m_divisor = 1;
      break;default: return false;
   }
   $a  = 6378137 / $m_divisor; // Earth semi-major axis      (WGS84)
   $e2 = 6.69437999014E-3;     // First eccentricity squared (WGS84)

   $lat_lng_count = count($lat_arr);
   if ( !$w_arr) {
      $w_arr = array_fill(0, $lat_lng_count, 1.0);
   }
   $x_arr = array();
   $y_arr = array();
   $z_arr = array();
   $rad = M_PI / 180;
   $one_e2 = 1 - $e2;
   for ($i = 0; $i < $lat_lng_count; $i++) {
      $lat = $lat_arr[$i];
      $lng = $lng_arr[$i];
      $sin_lat = sin($lat * $rad);
      $sin_lng = sin($lng * $rad);
      $cos_lat = cos($lat * $rad);
      $cos_lng = cos($lng * $rad);
      // height = 0 (!)
      $N = $a / sqrt(1 - $e2 * $sin_lat * $sin_lat);
      $x_arr[$i] = $N * $cos_lat * $cos_lng;
      $y_arr[$i] = $N * $cos_lat * $sin_lng;
      $z_arr[$i] = $N * $one_e2  * $sin_lat;
   }
   $h = $global_kernel_radius;
   $h2 = $h * $h;
   $max_K_sum     = -1;
   $max_K_sum_idx = -1;
   for ($i = 0; $i < $lat_lng_count; $i++) {
      $xi = $x_arr[$i];
      $yi = $y_arr[$i];
      $zi = $z_arr[$i];
      $K_sum  = 0;
      for ($j = 0; $j < $lat_lng_count; $j++) {
         $dx = $xi - $x_arr[$j];
         $dy = $yi - $y_arr[$j];
         $dz = $zi - $z_arr[$j];
         $d2 = $dx * $dx + $dy * $dy + $dz * $dz;
         $K_sum += $w_arr[$j] * ($d2 <= $h2 ? (2 - $d2 / $h2) : $h2 / $d2); // Trossian ;-)
         // $K_sum += $w_arr[$j] * exp(-0.5 * $d2 / $h2); // Gaussian
      }
      if ($max_K_sum < $K_sum) {
          $max_K_sum = $K_sum;
          $max_K_sum_i = $i;
      }
   }
   $winner_x   = $x_arr  [$max_K_sum_i];
   $winner_y   = $y_arr  [$max_K_sum_i];
   $winner_z   = $z_arr  [$max_K_sum_i];
   $winner_lat = $lat_arr[$max_K_sum_i];
   $winner_lng = $lng_arr[$max_K_sum_i];

   $sin_winner_lat = sin($winner_lat * $rad);
   $cos_winner_lat = cos($winner_lat * $rad);
   $sin_winner_lng = sin($winner_lng * $rad);
   $cos_winner_lng = cos($winner_lng * $rad);
   $east_x  = -$local_grid_step * $sin_winner_lng;
   $east_y  =  $local_grid_step * $cos_winner_lng;
   $east_z  =  0;
   $north_x = -$local_grid_step * $sin_winner_lat * $cos_winner_lng;
   $north_y = -$local_grid_step * $sin_winner_lat * $sin_winner_lng;
   $north_z =  $local_grid_step * $cos_winner_lat;

   if ($local_grid_radius > 0 && $local_grid_step > 0) {
      $r = intval($local_grid_radius / $local_grid_step);
      $r2 = $r * $r;
      $h = $local_kernel_radius;
      $h2 = $h * $h;
      $max_L_sum     = -1;
      $max_L_sum_idx = -1;
      for ($i = -$r; $i <= $r; $i++) {
         $winner_east_x = $winner_x + $i * $east_x;
         $winner_east_y = $winner_y + $i * $east_y;
         $winner_east_z = $winner_z + $i * $east_z;
         $j_max = intval(sqrt($r2 - $i * $i));
         for ($j = -$j_max; $j <= $j_max; $j++) {
            $x = $winner_east_x + $j * $north_x;
            $y = $winner_east_y + $j * $north_y;
            $z = $winner_east_z + $j * $north_z;
            $L_sum  = 0;
            for ($k = 0; $k < $lat_lng_count; $k++) {
               $dx = $x - $x_arr[$k];
               $dy = $y - $y_arr[$k];
               $dz = $z - $z_arr[$k];
               $d2 = $dx * $dx + $dy * $dy + $dz * $dz;
               if ($d2 < $h2) {
                  $L_sum += $w_arr[$k] * ($h2 - $d2); // Epanechnikov
               }
            }
            if ($max_L_sum < $L_sum) {
                $max_L_sum = $L_sum;
                $max_L_sum_i = $i;
                $max_L_sum_j = $j;
            }
         }
      }
      $x = $winner_x + $max_L_sum_i * $east_x + $max_L_sum_j * $north_x;
      $y = $winner_y + $max_L_sum_i * $east_y + $max_L_sum_j * $north_y;
      $z = $winner_z + $max_L_sum_i * $east_z + $max_L_sum_j * $north_z;

   } else if ($local_grid_radius > 0) {
      $r = $local_grid_radius;
      $r2 = $r * $r;
      $wx_sum = 0;
      $wy_sum = 0;
      $wz_sum = 0;
      $w_sum  = 0;
      for ($k = 0; $k < $lat_lng_count; $k++) {
         $xk = $x_arr[$k];
         $yk = $y_arr[$k];
         $zk = $z_arr[$k];
         $dx = $winner_x - $xk;
         $dy = $winner_y - $yk;
         $dz = $winner_z - $zk;
         $d2 = $dx * $dx + $dy * $dy + $dz * $dz;
         if ($d2 <= $r2) {
            $wk = $w_arr[$k];
            $wx_sum += $wk * $xk;
            $wy_sum += $wk * $yk;
            $wz_sum += $wk * $zk;
            $w_sum  += $wk;
         }
      }
      $x = $wx_sum / $w_sum;
      $y = $wy_sum / $w_sum;
      $z = $wz_sum / $w_sum;
      $max_L_sum_i = false;
      $max_L_sum_j = false;

   } else {
      return array($winner_lat, $winner_lng, $max_K_sum_i, false, false);
   }

   $deg = 180 / M_PI;
   $a2 = $a * $a;
   $e4 = $e2 * $e2;
   $p = sqrt($x * $x + $y * $y);
   $zeta = (1 - $e2) * $z * $z / $a2;
   $rho  = ($p * $p / $a2 + $zeta - $e4) / 6;
   $rho3 = $rho * $rho * $rho;
   $s = $e4 * $zeta * $p * $p / (4 * $a2);
   $t = pow($s + $rho3 + sqrt($s * ($s + 2 * $rho3)), 1 / 3);
   $u = $rho + $t + $rho * $rho / $t;
   $v = sqrt($u * $u + $e4 * $zeta);
   $w = $e2 * ($u + $v - $zeta) / (2 * $v);
   $k = 1 + $e2 * (sqrt($u + $v + $w * $w) + $w) / ($u + $v);
   $lat = atan($k * $z / $p) * $deg;
   $lng = atan2($y, $x) * $deg;

   return array($lat, $lng, $max_K_sum_i, $max_L_sum_i, $max_L_sum_j);
}

距离是欧几里得而不是大圆这一事实对手头的任务的影响应该可以忽略不计。计算大圆距离会更加麻烦，并且只会导致非常远的点的权重显着降低 - 但这些点的权重已经非常低。原则上，不同的内核可以达到相同的效果。超出一定距离的完全截止的内核，如 Epanechnikov 内核，根本不存在这个问题（在实践中）。

WGS84 基准的 lat,lng 和 x,y,z 之间的转换是准确给出的（尽管不保证数值稳定性），更多的是作为参考而不是真正需要。如果要考虑高度，或者需要更快的反向转换，请参考维基百科文章。

Epanechnikov 核除了比 Gaussian 和 Trossian 核“更局部”外，还具有第二个循环最快的优点，即 O(ng)，其中 g 是局部网格的点数，并且可以如果 n 很大，也可以在第一个循环中使用，即 O(n²)。

score 10 · Accepted Answer

这可以通过寻找危险面来解决。参见罗斯莫公式。

这就是捕食者问题。给定一组地理定位的尸体，捕食者的巢穴在哪里？罗斯莫公式解决了这个问题。

score 7 · Accepted Answer

Find the point with the largest density estimate.

Should be pretty much straightforward. Use a kernel radius that roughly covers a large airport in diameter. A 2D Gaussian or Epanechnikov kernel should be fine.

http://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation

This is similar to computing a Heap Map: http://en.wikipedia.org/wiki/Heat_map and then finding the brightest spot there. Except it computes the brightness right away.

For fun I read a 1% sample of the Geocoordinates of DBpedia (i.e. Wikipedia) into ELKI, projected it into 3D space and enabled the density estimation overlay (hidden in the visualizers scatterplot menu). You can see there is a hotspot on Europe, and to a lesser extend in the US. The hotspot in Europe is Poland I believe. Last I checked, someone apparently had created a Wikipedia article with Geocoordinates for pretty much any town in Poland. The ELKI visualizer, unfortunately, neither allows you to zoom in, rotate, or reduce the kernel bandwidth to visually find the most dense point. But it's straightforward to implement yourself; you probably also don't need to go into 3D space, but can just use latitudes and longitudes.

Density of DBpedia Geo data.

Kernel Density Estimation should be available in tons of applications. The one in R is probably much more powerful. I just recently discovered this heatmap in ELKI, so I knew how to quickly access it. See e.g. http://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html for a related R function.

On your data, in R, try for example:

library(kernSmooth)
smoothScatter(data, nbin=512, bandwidth=c(.25,.25))

this should show a strong preference for Chicago.

library(kernSmooth)
dens=bkde2D(data, gridsize=c(512, 512), bandwidth=c(.25,.25))
contour(dens$x1, dens$x2, dens$fhat)
maxpos = which(dens$fhat == max(dens$fhat), arr.ind=TRUE)
c(dens$x1[maxpos[1]], dens$x2[maxpos[2]])

yields [1] 42.14697 -88.09508, which is less than 10 miles from Chicago airport.

To get better coordinates try:

rerunning on a 20x20 miles area around the estimated coordinates
a non-binned KDE in that area
better bandwidth selection with dpik
higher grid resolution

score 5 · Accepted Answer

在天体物理学中，我们使用所谓的“半质量半径”。给定一个分布及其中心，半质量半径是包含一半分布点的圆的最小半径。

这个量是点分布的特征长度。

如果您希望直升机的家是点最大集中的地方，那么它就是具有最小半质量半径的点！

我的算法如下：对于每个点，你计算这个半质量半径，以当前点的分布为中心。直升机的“家”将是具有最小半质量半径的点。

我已经实现了它，计算中心是42.149994 -88.133698（在芝加哥）我还使用了总质量的 0.2 而不是天体物理学中通常使用的 0.5（一半）。

这是我（在 python 中）找到直升机所在地的算法：

import math

import numpy

def inside(points,center,radius):
     ids=(((points[:,0]-center[0])**2.+(points[:,1]-center[1])**2.)<=radius**2.)
     return points[ids]

points = numpy.loadtxt(open('points.txt'),comments='#')

npoints=len(points)
deltar=0.1

idcenter=None
halfrmin=None

for i in xrange(0,npoints):
    center=points[i]
    radius=0.

    stayHere=True
    while stayHere:
         radius=radius+deltar
         ninside=len(inside(points,center,radius))
         #print 'point',i,'r',radius,'in',ninside,'center',center
         if(ninside>=npoints*0.2):
              if(halfrmin==None or radius<halfrmin):
                   halfrmin=radius
                   idcenter=i
                   print 'point',i,halfrmin,idcenter,points[idcenter]
              stayHere=False

#print halfrmin,idcenter
print points[idcenter]

score 4 · Accepted Answer

您可以使用DBSCAN来完成该任务。

DBSCAN 是具有噪声概念的基于密度的聚类。你需要两个参数：

首先是一个簇至少应该有的点数"minpoints"。第二个称为邻域参数"epsilon"，它为应该包含在集群中的周围点设置距离阈值。

整个算法是这样工作的：

从集合中尚未访问的任意点开始
从 epsilon 邻域中检索点，标记全部为已访问
1. 如果您在该邻域中找到了足够多的点（> minpoints 参数），您将启动一个新集群并分配这些点。现在针对该集群中的每个点再次递归到第 2 步。
2. 如果没有，请将此点声明为噪声
重新来过，直到你访问了所有的点

它实现起来非常简单，并且已经有很多框架支持这种算法。要找到集群的平均值，您可以简单地从其邻域中获取所有分配点的平均值。

但是，与@TylerDurden 提出的方法不同，这需要参数化——因此您需要找到一些适合您的问题的手动调整参数。

在您的情况下，如果飞机可能停留在您在机场跟踪的 10% 的时间，您可以尝试将 minpoints 设置为总点数的 10%。密度参数 epsilon 取决于您的地理传感器的分辨率和您使用的距离度量——我建议地理数据使用半正弦距离。

score 3 · Accepted Answer

在此处输入图像描述

如何将地图分成许多区域，然后在平面最多的区域中找到平面中心。算法将是这样的

set Zones[40]
 foreach Planes in Planes
   Zones[GetZone(Plane.position)].Add(Plane)

设置MaxZone = Zones[0]
 foreach Zones in Zones
    if MaxZone.Length() < Zone.Length()
           MaxZone = 区域

在 MaxZone 中设置Center
 foreach平面
     中心.X += 平面.X
     中心.Y += 平面.Y
Center.X /= MaxZone.Length
Center.Y /= MaxZone.Length

score 3 · Accepted Answer

我在这台机器上只有一个旧的编译器，所以我制作了一个 ASCII 版本的编译器。它“绘制”（以 ASCII 格式）地图 - 点是点，X 是真实源的位置，G 是猜测源的位置。如果两者重叠，则仅显示 X。

示例（难度分别为 1.5 和 3）：在此处输入图像描述

这些点是通过选择一个随机点作为源，然后随机分布点来生成的，使它们更有可能更接近源。

DIFFICULTY是一个浮点常数，它调节初始点的生成——这些点离源更近的可能性有多大——如果它是 1 或更小，程序应该能够猜出确切的源，或者非常接近。在 2.5 时，它应该仍然相当不错。在 4+ 时，它会开始猜得更糟，但我认为它仍然比人类猜得好。

它可以通过对 X 和 Y 使用二分搜索来优化——这会使猜测变得更糟，但会快得多。或者从更大的块开始，然后进一步分割最好的块（或最好的块和它周围的 8 个块）。对于更高分辨率的系统，其中之一将是必要的。虽然这是一种非常幼稚的方法，但它似乎在 80x24 系统中运行良好。:D

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define Y 24
#define X 80

#define DIFFICULTY 1 // Try different values... 

static int point[Y][X];

double dist(int x1, int y1, int x2, int y2)
{
    return sqrt((y1 - y2)*(y1 - y2) + (x1 - x2)*(x1 - x2));
}

main()
{
    srand(time(0));
    int y = rand()%Y;
    int x = rand()%X;

    // Generate points
    for (int i = 0; i < Y; i++)
    {
        for (int j = 0; j < X; j++)
        {
            double u = DIFFICULTY * pow(dist(x, y, j, i), 1.0 / DIFFICULTY);
            if ((int)u == 0)
                u = 1;
            point[i][j] = !(rand()%(int)u);
        }
    }

    // Find best source
    int maxX = -1;
    int maxY = -1;
    double maxScore = -1;
    for (int cy = 0; cy < Y; cy++)
    {
        for (int cx = 0; cx < X; cx++)
        {
            double score = 0;
            for (int i = 0; i < Y; i++)
            {
                for (int j = 0; j < X; j++)
                {
                    if (point[i][j] == 1)
                    {
                        double d = dist(cx, cy, j, i);
                        if (d == 0)
                            d = 0.5;
                        score += 1000 / d;
                    }
                }
            }
            if (score > maxScore || maxScore == -1)
            {
                maxScore = score;
                maxX = cx;
                maxY = cy;
            }
        }
    }

    // Print out results
    for (int i = 0; i < Y; i++)
    {
        for (int j = 0; j < X; j++)
        {
            if (i == y && j == x)
                printf("X");
            else if (i == maxY && j == maxX)
                printf("G");            
            else if (point[i][j] == 0)
                printf(" ");
            else if (point[i][j] == 1)
                printf(".");
        }
    }
    printf("Distance from real source: %f", dist(maxX, maxY, x, y));

    scanf("%d", 0);

}

score 1 · Accepted Answer

您可以轻松地将 Tyler Durden 引用的 Rossmo 公式改编为您的案例，只需几个简单的注释：

公式：

这个公式给出了接近捕食者或连环杀手基本行动存在的概率。在你的情况下，它可以给出一个碱基在某个点的概率。我稍后会解释如何使用它。你可以这样写：

Proba（基于 A 点）= Sum{在所有点上} ( Phi/(dist^f)+(1-Phi)(B*(gf))/(2B-dist)^g )

使用欧几里得距离

您想要欧几里得距离而不是曼哈顿的距离，因为飞机或直升机不受道路/街道的约束。因此，如果您要跟踪飞机而不是连环杀手，则使用欧几里得距离是正确的方法。所以公式中的“dist”是你测试的点和考虑的点之间的欧几里得距离

取合理的变量 B

变量 B 用于表示规则“相当聪明的杀手不会杀死他的邻居”。在您的情况下，也将适用，因为没有人使用飞机/直升机到达下一个街角。我们可以假设最小的旅程是例如 10 公里或任何合理的适用于您的案例。

指数因子 f

因子 f 用于增加距离的权重。例如，如果所有地点都在一个小区域内，您可能需要一个很大的因子 f，因为如果您的所有数据点都在同一个扇区中，机场/基地/总部的概率会迅速降低。g 以类似的方式工作，允许选择“base is not just next to the spot”区域的大小

系数 :

同样，这个因素必须使用您对问题的了解来确定。它允许在“基地靠近点”和“我不会使用飞机制造 5 m”之间选择最准确的因素，例如，如果您认为第二个因素几乎无关紧要，您可以将 Phi 设置为 0.95(0<Phi<1)如果两者都有趣的是 phi 将在 0.5 左右

如何将其实现为有用的东西：

首先，您要将地图划分为小方块：对地图进行网格划分（就像 invisal 所做的那样）（方块越小，结果越准确（通常））然后使用公式找到更可能的位置。事实上，网格只是一个包含所有可能位置的数组。（如果你想要准确，你可以增加可能点的数量，但这需要更多的计算时间，而且 PhP 并不以惊人的速度而闻名）

算法：

//define all the factors you need(B , f , g , phi)

for(i=0..mesh_size) // computing the probability of presence for each square of the mesh
{
  P(i)=0;
  geocode squarePosition;//GeoCode of the square's center 
  for(j=0..geocodearray_size)//sum on all the known spots
  {
     dist=Distance(geocodearray[j],squarePosition);//small function returning distance between two geocodes

         P(i)+=(Phi/pow(dist,f))+(1-Phi)*pow(B,g-f)/pow(2B-dist,g);
  }
 }

return geocode corresponding to max(P(i))

希望对你有帮助

score 1 · Accepted Answer

首先，我想表达我对您在说明和解释问题方面的方法的喜爱..

如果我站在你的立场上，我会选择像DBSCAN这样的基于密度的算法 ，然后在对区域进行聚类并去除噪声点之后，将保留一些区域（选择）......然后我将采用密度最高的集群点并计算平均点并找到最接近它的实际点。完成，找到地方了！:)。

问候，

score 1 · Accepted Answer

一个简单的混合模型似乎可以很好地解决这个问题。

通常，要获得一个与数据集中所有其他点的距离最小的点，您可以取平均值。在这种情况下，您希望找到一个与集中点子集的距离最小的点。如果您假设一个点既可以来自集中的兴趣点集，也可以来自散布的背景点集，那么这就给出了一个混合模型。

我在下面包含了一些python代码。集中区域由高精度正态分布建模，背景点由低精度正态分布或数据集上边界框上的均匀分布建模（有一行代码可以注释掉在这些选项之间切换）。此外，混合模型可能有些不稳定，因此在随机初始条件下运行几次 EM 算法并选择具有最高对数似然的运行会得到更好的结果。

如果你真的在看飞机，那么添加某种时间相关的动力学可能会极大地提高你推断本垒的能力。

我也会对罗西莫的公式保持警惕，因为它包含了一些关于犯罪分布的非常强的假设。

#the dataset
sdata='''41.892694,-87.670898
42.056048,-88.000488
41.941744,-88.000488
42.072361,-88.209229
42.091933,-87.982635
42.149994,-88.133698
42.171371,-88.286133
42.23241,-88.305359
42.196811,-88.099365
42.189689,-88.188629
42.17646,-88.173523
42.180531,-88.209229
42.18168,-88.187943
42.185496,-88.166656
42.170485,-88.150864
42.150634,-88.140564
42.156743,-88.123741
42.118555,-88.105545
42.121356,-88.112755
42.115499,-88.102112
42.119319,-88.112411
42.118046,-88.110695
42.117791,-88.109322
42.182189,-88.182449
42.194145,-88.183823
42.189057,-88.196182
42.186513,-88.200645
42.180917,-88.197899
42.178881,-88.192062
41.881656,-87.6297
41.875521,-87.6297
41.87872,-87.636566
41.872073,-87.62661
41.868239,-87.634506
41.86875,-87.624893
41.883065,-87.62352
41.881021,-87.619743
41.879998,-87.620087
41.8915,-87.633476
41.875163,-87.620773
41.879125,-87.62558
41.862763,-87.608757
41.858672,-87.607899
41.865192,-87.615795
41.87005,-87.62043
42.073061,-87.973022
42.317241,-88.187256
42.272546,-88.088379
42.244086,-87.890625
42.044512,-88.28064
39.754977,-86.154785
39.754977,-89.648437
41.043369,-85.12207
43.050074,-89.406738
43.082179,-87.912598
42.7281,-84.572754
39.974226,-83.056641
38.888093,-77.01416
39.923692,-75.168457
40.794318,-73.959961
40.877439,-73.146973
40.611086,-73.740234
40.627764,-73.234863
41.784881,-71.367187
42.371988,-70.993652
35.224587,-80.793457
36.753465,-76.069336
39.263361,-76.530762
25.737127,-80.222168
26.644083,-81.958008
30.50223,-87.275391
29.436309,-98.525391
30.217839,-97.844238
29.742023,-95.361328
31.500409,-97.163086
32.691688,-96.877441
32.691688,-97.404785
35.095754,-106.655273
33.425138,-112.104492
32.873244,-117.114258
33.973545,-118.256836
33.681497,-117.905273
33.622982,-117.734985
33.741828,-118.092041
33.64585,-117.861328
33.700707,-118.015137
33.801189,-118.251343
33.513132,-117.740479
32.777244,-117.235107
32.707939,-117.158203
32.703317,-117.268066
32.610821,-117.075806
34.419726,-119.701538
37.750358,-122.431641
37.50673,-122.387695
37.174817,-121.904297
37.157307,-122.321777
37.271492,-122.033386
37.435238,-122.217407
37.687794,-122.415161
37.542025,-122.299805
37.609506,-122.398682
37.544203,-122.0224
37.422151,-122.13501
37.395971,-122.080078
45.485651,-122.739258
47.719463,-122.255859
47.303913,-122.607422
45.176713,-122.167969
39.566,-104.985352
39.124201,-94.614258
35.454518,-97.426758
38.473482,-90.175781
45.021612,-93.251953
42.417881,-83.056641
41.371141,-81.782227
33.791132,-84.331055
30.252543,-90.439453
37.421248,-122.174835
37.47794,-122.181702
37.510628,-122.254486
37.56943,-122.346497
37.593373,-122.384949
37.620571,-122.489319
36.984249,-122.03064
36.553017,-121.893311
36.654442,-121.772461
36.482381,-121.876831
36.15042,-121.651611
36.274518,-121.838379
37.817717,-119.569702
39.31657,-120.140991
38.933041,-119.992676
39.13785,-119.778442
39.108019,-120.239868
38.586082,-121.503296
38.723354,-121.289062
37.878444,-119.437866
37.782994,-119.470825
37.973771,-119.685059
39.001377,-120.17395
40.709076,-73.948975
40.846346,-73.861084
40.780452,-73.959961
40.778829,-73.958931
40.78372,-73.966012
40.783688,-73.965325
40.783692,-73.965615
40.783675,-73.965741
40.783835,-73.965873
'''

import StringIO
import numpy as np
import re

import matplotlib.pyplot as plt

def lp(l):
    return map(lambda m: float(m.group()),re.finditer('[^, \n]+',l))

data=np.array(map(lp,StringIO.StringIO(sdata)))

xmn=np.min(data[:,0])
xmx=np.max(data[:,0])
ymn=np.min(data[:,1])
ymx=np.max(data[:,1])

# area of the point set bounding box
area=(xmx-xmn)*(ymx-ymn)

M_ITER=100 #maximum number of iterations
THRESH=1e-10 # stopping threshold

def em(x):
    print '\nSTART EM'
    mlst=[]

    mu0=np.mean( data , 0 ) # the sample mean of the data - use this as the mean of the low-precision gaussian

    # the mean of the high-precision Gaussian - this is what we are looking for
    mu=np.random.rand( 2 )*np.array([xmx-xmn,ymx-ymn])+np.array([xmn,ymn])

    lam_lo=.001  # precision of the low-precision Gaussian
    lam_hi=.1 # precision of the high-precision Gaussian
    prz=np.random.rand( 1 ) # probability of choosing the high-precision Gaussian mixture component

    for i in xrange(M_ITER):
        mlst.append(mu[:])

        l_hi=np.log(prz)+np.log(lam_hi)-.5*lam_hi*np.sum((x-mu)**2,1)
        #low-precision normal background distribution
        l_lo=np.log(1.0-prz)+np.log(lam_lo)-.5*lam_lo*np.sum((x-mu0)**2,1)
        #uncomment for the uniform background distribution
        #l_lo=np.log(1.0-prz)-np.log(area)

        #expectation step
        zs=1.0/(1.0+np.exp(l_lo-l_hi))

        #compute bound on the likelihood 
        lh=np.sum(zs*l_hi+(1.0-zs)*l_lo)
        print i,lh

        #maximization step
        mu=np.sum(zs[:,None]*x,0)/np.sum(zs) #mean
        lam_hi=np.sum(zs)/np.sum(zs*.5*np.sum((x-mu)**2,1)) #precision
        prz=1.0/(1.0+np.sum(1.0-zs)/np.sum(zs)) #mixure component probability

        try:
            if np.abs((lh-old_lh)/lh)<THRESH:
                break
        except: 
            pass

        old_lh=lh

        mlst.append(mu[:])

    return lh,lam_hi,mlst    

if __name__=='__main__':

    #repeat the EM algorithm a number of times and get the run with the best log likelihood
    mx_prm=em(data)
    for i in xrange(4):
        prm=em(data)

        if prm[0]>mx_prm[0]:
            mx_prm=prm

        print prm[0]
        print mx_prm[0]

    lh,lam_hi,mlst=mx_prm
    mu=mlst[-1]

    print 'best loglikelihood:', lh
    #print 'final precision value:', lam_hi
    print 'point of interest:', mu
    plt.plot(data[:,0],data[:,1],'.b')

    for m in mlst:
        plt.plot(m[0],m[1],'xr')

    plt.show()

score 1 · Accepted Answer

虚拟地球很好地解释了如何相对快速地做到这一点。他们还提供了代码示例。请查看http://soulsolutions.com.au/Articles/ClusteringVirtualEarthPart1.aspx

score 0 · Accepted Answer

您可以采用最小生成树并删除最长的边。较小的树为您提供了要查找的中心点。算法名称是单链接k-clustering。这里有一个帖子：https ://stats.stackexchange.com/questions/1475/visualization-software-for-clustering 。

score 0 · Accepted Answer

为什么不这样：

对于每个点，计算它与所有其他点的距离并求和。
总和最小的点是你的中心。

也许总和不是最好的指标。可能是“小距离”最多的点？

score 0 · Accepted Answer

对距离求和。取总和距离最小的点。

function () {
    for i in points P:
        S[i] = 0
        for j in points P:
            S[i] += distance(P[i], P[j])
    return min(S);
}

algorithm - 如何找到一组数据点的中心？

14 回答 14

Related

Reference