azure-data-explorer - 查找地图上两组点之间的最短距离（大型数据集）

Question

我在两个单独的表中有两组点，如下所示：t1：Point_1 |Lat | Long .................. Point_n |Lat |Long 和 t2 : Pt_1 |Lat | Long .................... Pt_m |Lat |Long，两个表之间没有关系。为 t1 中的每个 pt 识别 t2 中前 3 个最近点的最佳方法（最少资源）是什么，特别是当 t1 和 t2 很大时？也许是地理哈希？我尝试过并且似乎可以很好地处理小型数据集的是：

t1
| extend blah=1
| join kind=fullouter (t2 |extend blah=1) on blah
| extend distance = geo_distance_2points(Long,Lat,Long1,Lat1)
|sort by spbldrom_code, distance asc
| extend rnk = row_number(1,point <> prev(point))
| where rnk<=3
|project point, pt, distance, rnk

请原谅马虎；我在学。谢谢！

score 2 · Accepted Answer

尝试通过过滤掉不相关或格式错误的行和列来减小连接运算符两侧的数据大小。也许您可以使用 geo_point_in_polygon\circle() 来丢弃不相关的数据。
尝试使用广播连接或随机连接： https ://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/broadcastjoin https://docs.microsoft.com/en-us/azure/数据浏览器/kusto/查询/shufflequery
您可以通过两种方式使用 s2\geohash\h3 散列函数：

一种。对于每张桌子，将附近的点组合成一个代表点。这个想法是使用散列单元中心点作为驻留在单元中的所有点的代表。这将减少表格大小。就像是：

          datatable(lng:real, lat:real)
          [
             10.1234, 53,
             10.3579, 53,
             10.6842, 53,
          ]
          | summarize by hash = geo_point_to_s2cell(lng, lat, 8)
          | project geo_s2cell_to_central_point(hash)

湾。计算每个点的哈希值并加入哈希值。就像是：

        let t1 =
            datatable(lng:real, lat:real)
            [
              10.3579, 53,
              10.6842, 53,
            ];
        let t2 =
            datatable(lng:real, lat:real)
            [
              10.1234, 53,
            ];
        t1 | extend hash = geo_point_to_s2cell(lng, lat, 8)
        | join kind=fullouter  hint.strategy=broadcast (t2 | extend hash = geo_point_to_s2cell(lng, lat, 8)) on hash

也许分区运算符也可以加快查询速度： https ://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/partitionoperator

score 0 · Accepted Answer

我发现了我认为更好的方法来做到这一点并想分享它。

首先，曲面细分/地理散列的问题是这样
的：假设您在两个表 T1 和 T2 中有两组坐标点，以及为 T1 中的每个点计算 T2 中的最近点。现在让我们假设您在 T1 中有一个点非常靠近地理散列单元的边界，而 T2 中的另一个点靠近同一边界，但在相邻的地理散列单元中。使用基于hash id的join方法，算法永远不会计算这两个点之间的距离，虽然它们很接近，所以最终的结果会错过这一对。
连接两个表以计算点间距离的更好方法是根据截断坐标生成连接键。所以对于每个表中的每个点，我们根据点间距离的相关性（我们关心的最大点间距离是多少）创建这个键。
示例：对于具有坐标 ( 45.1234; -120.5678 ) 的点，连接键可以是 25.1-120.6 （截断和连接）。通过这种舍入并使用连接方法，我们将捕获表 2 中距离表 1 中该点 15 公里半径内的所有内容。以 25-120 作为连接键将捕获 150 公里内的所有内容。这将显着减少连接表并避免地理散列方法的警告。

在这一点上，我更擅长写散文而不是代码:)，但是我希望我上面描述的内容是有意义的。它确实适用于我的项目，同时规避了资源问题（cpu/mem）。

score 0 · Accepted Answer

很高兴您找到了适合您的方法。您可以尝试的另一种选择也是考虑相邻小区。

H3 哈希具有这样的能力：https ://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/geo-h3cell-rings-function

像这样的东西：

let h3_resolution = 8;
let t1 = datatable(lng1:real, lat1:real)
[
    40.75864778392896, -73.97856558479198,
    40.74860253711237, -73.98577679198793,
    40.741092676839024, -73.9902397446769,
];
let t2 = datatable(lng2:real, lat2:real)
[
    40.75594965648444, -73.98157034840024,
    40.766085141039774, -74.01798702196743
];
t1 
| extend hash = geo_point_to_h3cell(lng1, lat1, h3_resolution)
| join kind = inner (
    t2 
    | extend rings = geo_h3cell_rings(geo_point_to_h3cell(lng2, lat2, h3_resolution),1)
    | project lng2, lat2, hash_array = array_concat(rings[0], rings[1])
    | mv-expand hash_array to typeof(string)
) on $left.hash == $right.hash_array
| project-away hash, hash_array
| extend distance = geo_distance_2points(lng1, lat1, lng2, lat2)
| project p1 = tostring(pack_array(lng1, lat1)), p2 = pack_array(lng2, lat2), distance
| sort by distance asc 
| summarize closest_3_points = make_list(p2, 3) by p1

azure-data-explorer - 查找地图上两组点之间的最短距离（大型数据集）

3 回答 3

Related

Reference