python - 生成纬度和经度位置之间距离矩阵的最快方法是什么？

Question

谢谢您阅读此篇。目前我有很多位置的纬度和经度，我需要为 10 公里内的位置创建一个距离矩阵。（可以用 0 远超过 10 公里的位置之间的距离填充矩阵）。

数据如下：

place_coordinates=[[lat1, lon1],[lat2,lat2],...]

在这种情况下，我使用下面的代码来计算它，但是它需要很长时间。

place_correlation = pd.DataFrame(
   squareform(pdist(place_coordinates, metric=haversine)),
   index=place_coordinates,
   columns=place_coordinates
)

使用时squareform，如果在10公里外，我不知道如何不保存或不计算。

最快的方法是什么？

先感谢您！

score 1 · Accepted Answer

首先，你需要使用haversine公制来计算距离吗？您使用哪种实现方式？如果您使用例如euclidean指标，您的计算会更快，但我想您有充分的理由选择这个指标。

在这种情况下，最好使用更优化的实现haversine（但我不知道您使用哪种实现）。检查例如这个 SO question。

我猜你正在使用pdistand squareformfrom scipy.spatial.distance。当您查看后面的实现（此处）时，您会发现它们正在使用 for 循环。在这种情况下，您宁愿使用一些矢量化实现（例如上面链接问题中的这个）。

import numpy as np
import itertools
from scipy.spatial.distance import pdist, squareform
from haversine import haversine  # pip install haversine

# original approach
place_coordinates = [(x, y) for x in range(10) for y in range(10)]
d = pdist(place_coordinates, metric=haversine)

# approach using combinations
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
d2 = [haversine(x, y) for (x, y) in place_coordinates_comb]

# just ensure that using combinations give you the same results as using pdist
np.testing.assert_array_equal(d, d2)

# vectorized version (taken from the link above)
# 1) create combination (note that haversine implementation from the link above takes (lon1, lat1, lon2, lat2) as arguments, that's why we do flatten
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
place_coordinates_comb_flatten = [(*x, *y) for (x, y) in place_coordinates_comb]
# 2) use format required by this impl
lon1, lat1, lon2, lat2 = np.array(place_coordinates_comb_flatten).T
# 3) vectorized comp
d_vect = haversine_np(lon1, lat1, lon2, lat2)

# it slightly differs from the original haversine package, but it's ok imo and vectorized implementation can be ofc improve to return exactly the same results
np.testing.assert_array_equal(d, d_vect)

当您比较时间时（绝对数字会因使用的机器而异）：

%timeit pdist(place_coordinates, metric=haversine)
# 15.7 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit haversine_np(lon1, lat1, lon2, lat2)
# 241 µs ± 7.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这相当多（约快 60 倍）。当您有很长的数组时（您使用了多少个坐标？），这对您有很大帮助。

最后，您可以使用您的代码组合它：

place_correlation = pd.DataFrame(squareform(d_vect), index=place_coordinates, columns=place_coordinates)

额外的改进可能是使用另一个指标（例如euclidean，这将更快）来快速说明哪些距离在 10 公里之外，然后计算haversine其余的距离。

python - 生成纬度和经度位置之间距离矩阵的最快方法是什么？

1 回答 1

Related

Reference