sql - 按值（不是列）分组后从组中选择一个随机条目？

Question

我想使用 Postgres 和 PostGIS 编写查询。我也将 Rails 与rgeo,rgeo-activerecord和一起使用activerecord-postgis-adapter，但 Rails 的东西并不重要。

表结构：

measurement
 - int id
 - int anchor_id
 - Point groundtruth
 - data (not important for the query)

示例数据：

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 2         | POINT(1 4)  | ...
4  | 3         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 1         | POINT(4 3)  | ...
10 | 5         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...

该表是某种手动创建view的，用于更快的查找（具有数百万行）。否则我们必须加入 8 个表，它会变得更慢。但这不是问题的一部分。

简单版：

参数：

观点p
整数d

查询应该做什么：

1.查询查找所有groundtruth具有distance < dfrom Point的 Pointsp

SQL 非常简单：WHERE st_distance(groundtruth, p) < d

2.现在我们有了一个带有s的groundtruth点列表。anchor_id如上表所示，可能有多个相同的 groundtruth-anchor_id 元组。例如：anchor_id=3和groundtruth=POINT(1 4)。

3.接下来我想通过随机选择其中一个来消除相同的元组（！）。为什么不直接拿第一呢？因为data列不同。

在 SQL 中选择一个随机行：SELECT ... ORDER BY RANDOM() LIMIT 1

我对这一切的问题是：我可以想象一个使用 SQLLOOP和大量子查询的解决方案，但肯定有一个解决方案使用GROUP BY或其他一些方法可以使它更快。

完整版本：

与上述基本相同，不同之处在于：输入参数变化：

很多积分p1...p312456345
还是一个d

如果简单查询有效，则可以使用LOOPin SQL 来完成。但也许有更好（更快）的解决方案，因为数据库真的很大！

解决方案

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
    *
FROM measurement m, ps
WHERE EXISTS (
    SELECT 1
    FROM ps
    WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();

感谢欧文·布兰德施泰特！

score 1 · Accepted Answer

为了消除重复，这可能是 PostgreSQL 中最有效的查询：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d

有关此查询样式的更多信息：

在每个 GROUP BY 组中选择第一行？

正如评论中提到的，这给了你一个任意的选择。如果您需要random，则要贵一些：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement
WHERE  st_distance(p, groundtruth) < d
ORDER  BY anchor_id, groundtruth, random()

第二部分更难优化。EXISTS半连接可能是最快的选择。对于给定的表ps (p point)：

SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
ORDER  BY anchor_id, groundtruth, random();

p这可以在一个足够接近时立即停止评估，并使查询的其余部分保持简单。

请务必使用匹配的GiST 索引来支持它。

如果您有一个数组作为输入，请即时创建一个CTE ：unnest()

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...

根据评论更新

如果您只需要一行作为答案，您可以简化：

WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM   measurement m
WHERE  EXISTS (
   SELECT 1
   FROM   ps
   WHERE  st_distance(ps.p, m.groundtruth) < d
   )
LIMIT  1;

更快`ST_DWithin()`

该函数可能更有效ST_DWithin()（以及匹配的 GiST 索引！）。
要获得一行（在此处使用子选择而不是 CTE）：

SELECT *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT  1;

要为距离内的每个点获取一行pd：

SELECT DISTINCT ON (ps.p) *
FROM   measurement m
JOIN  (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)

添加ORDER BY random()将使这个查询更加昂贵。如果没有random()，Postgres 只能从 GiST 索引中选择第一个匹配的行。否则，必须随机检索和排序所有可能的匹配项。

顺便说一句，LIMIT 1里面EXISTS是没有意义的。在我提供的链接或这个相关问题上阅读手册。

score 0 · Accepted Answer

我现在破解了它，但查询很慢......

WITH
  ps AS (
    SELECT unnest(p_array)
    ) AS p
  ),

  gtps AS (
    SELECT DISTINCT ON(ps.p)
      ps.p, m.groundtruth
    FROM measurement m, ps
    WHERE st_distance(m.groundtruth, ps.p) < d
    ORDER BY ps.p, RANDOM()
  )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

我的测试数据库包含 22000 行，我给了它两个输入值，大约需要 700 毫秒。最后可能有数百个输入值：-/

结果现在看起来像这样：

id  | anchor_id | groundtruth | p
-----------------------------------------
20  | 1         | POINT(0 2)  | POINT(1 0)
14  | 3         | POINT(0 2)  | POINT(1 0)
5   | 8         | POINT(0 2)  | POINT(1 0)
42  | 2         | POINT(4 1)  | POINT(2 2)
11  | 3         | POINT(4 8)  | POINT(4 8)
4   | 6         | POINT(4 8)  | POINT(4 8)
1   | 1         | POINT(6 2)  | POINT(7 3)
9   | 5         | POINT(6 2)  | POINT(7 3)
25  | 3         | POINT(6 2)  | POINT(9 1)
13  | 6         | POINT(6 2)  | POINT(9 1)
18  | 7         | POINT(6 2)  | POINT(9 1)

新的：

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, 0.5)
GROUP BY groundtruth, ps.p

实际结果：

p           | groundtruth | anchor_arr | id_arr
--------------------------------------------------
P1          | G1          | {1,3,2,..} | {9,8,11,..}
P1          | G2          | {4,3,5,..} | {1,8,23,..}
P1          | G3          | {6,8,9,..} | {12,7,6,..}
P2          | G1          | {6,6,2,..} | {15,2,10,..}
P2          | G4          | {7,9,1,..} | {5,4,3,..}
...         | ...         | ...        | ...

所以此刻我得到：

每个不同的 inputValue-groundtruth-tuple
对于每个元组，我得到一个数组，其中所有anchor_id对应groundtruth元组的部分
id以及所有对应于groundtruth-anchor_id关系的 s的数组

记住：

两个输入值可以“选择”相同groundtruth
单个groundtruth值可以有多个相同anchor_id的 s
每个元组都有一个不同groundtruth的anchor_idid

那么完成时缺少什么？：

我只需要一个随机行ps.p
这两个数组属于彼此。意思是：里面物品的顺序很重要！
这两个数组需要过滤（随机）：
- 对于anchor_id数组中出现不止一次的每个：保留一个随机的并删除所有其他的。这也意味着id从id-array 中删除每个已删除的对应项anchor_id

sql - 按值（不是列）分组后从组中选择一个随机条目？

简单版：

完整版本：

解决方案

2 回答 2

根据评论更新

更快ST_DWithin()

新的：

Related

Reference

更快`ST_DWithin()`