2

我有一个大的 .csv 数据集,其中包含 10e7 个点,坐标(纬度、经度)表示访问者的位置。我有另一个数据集,其中包含 10e3 个点,坐标代表商店的位置。

我想使用某种测地线公式将每个访客与最近的商店相关联。

我想要一些真正快速高效的东西,我可以在 python(例如 pandas)或 Google BigQuery 上运行。

有人可以给我一个线索吗?

4

2 回答 2

4

To add to Felipe answer:

You can use SQL UDF vs JS UDF
JS UDF have some Limits that SQL UDF do not

So equivalent SQL UDF you can use with the rest of Felipe's code is

CREATE TEMPORARY FUNCTION distance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
RETURNS FLOAT64 AS ((
WITH constants AS (
  SELECT 0.017453292519943295 AS p
) 
SELECT 12742 * ASIN(SQRT(
  0.5 - COS((lat2 - lat1) * p)/2 + 
  COS(lat1 * p) * COS(lat2 * p) * 
  (1 - COS((lon2 - lon1) * p))/2))
FROM constants
));

I tried to preserve layout of respective JS UDF as much as possible so you can see how it is created

于 2016-11-16T20:18:40.007 回答
3

这是一个快速的解决方案,可以在 DBpedia (v2014) 中为 21,221 个城市找到最近的 NOAA 气象站。

#standardSQL

CREATE TEMPORARY FUNCTION distance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """

  var p = 0.017453292519943295;    // Math.PI / 180
  var c = Math.cos;
  var a = 0.5 - c((lat2 - lat1) * p)/2 + 
          c(lat1 * p) * c(lat2 * p) * 
          (1 - c((lon2 - lon1) * p))/2;

  return 12742 * Math.asin(Math.sqrt(a)); // 2 * R; R = 6371 km

""";

SELECT *
FROM (
  SELECT city, country_label, distance, name weather_station, country, 
    RANK() OVER(PARTITION BY city ORDER BY distance DESC) rank
  FROM (
    SELECT city, a.country_label, distance(a.lat,a.lon,b.lat,b.lon) distance, b.name, b.country
    FROM (
      SELECT rdf_schema_label city, country_label, country,
        CAST(REGEXP_EXTRACT(point, r'(-?\d*\.\d*)') as FLOAT64) lat, 
        CAST(REGEXP_EXTRACT(point, r' (-?\d*\.\d*)') as FLOAT64) lon 
      FROM `fh-bigquery.dbpedia2014temp.City`
      WHERE point!='NULL'
    ) a
    JOIN (
      SELECT name, country, usaf, wban, lat, lon
      FROM `bigquery-public-data.noaa_gsod.stations`
      WHERE lat != 0.0 AND lon !=0.0
    ) b
    ON CAST(a.lat as INT64)=CAST(b.lat as INT64)
    AND CAST(a.lon as INT64)=CAST(b.lon as INT64)
  )
)
WHERE rank=1

注意事项:

enter image description here

于 2016-11-16T13:19:06.237 回答