相关的,前面的问题:
在按值(而不是列)分组后从组中选择一个随机条目?
我当前的查询如下所示:
WITH
points AS (
SELECT unnest(array_of_points) AS p
),
gtps AS (
SELECT DISTINCT ON(points.p)
points.p, m.groundtruth
FROM measurement m, points
WHERE st_distance(m.groundtruth, points.p) < distance
ORDER BY points.p, RANDOM()
)
SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()
语义:
有两个输入值:
- 第 4 行:点数组
array_of_points
- 第 12 行:一个双精度数:
distance
- 第 4 行:点数组
第一段(第 1-6 行):
- 从点数组创建一个表以用于...
第二段(第 8-14 行):
- 对于表内的每个点
points
:从表中获取一个随机(!)groundtruth
点measurement
,其距离 <distance
- 将这些元组保存在
gtps
表中
- 对于表内的每个点
第三段(第 16-19 行):
- 对于表中
groundtruth
的每个值gtps
:获取所有anchor_id
值并... - 如果一个
anchor_id
值不是唯一的:然后随机选择一个
- 对于表中
输出:
id
,anchor_id
,groundtruth
,p
(来自 的输入值array_of_points
)
示例表:
id | anchor_id | groundtruth | data
-----------------------------------
1 | 1 | POINT(1 4) | ...
2 | 3 | POINT(1 4) | ...
3 | 8 | POINT(1 4) | ...
4 | 6 | POINT(1 4) | ...
-----------------------------------
5 | 2 | POINT(3 2) | ...
6 | 4 | POINT(3 2) | ...
-----------------------------------
7 | 1 | POINT(4 3) | ...
8 | 1 | POINT(4 3) | ...
9 | 6 | POINT(4 3) | ...
10 | 7 | POINT(4 3) | ...
11 | 3 | POINT(4 3) | ...
-----------------------------------
12 | 1 | POINT(6 2) | ...
13 | 5 | POINT(6 2) | ...
示例结果:
id | anchor_id | groundtruth | p
-----------------------------------------
1 | 1 | POINT(1 4) | POINT(1 0)
2 | 3 | POINT(1 4) | POINT(1 0)
4 | 6 | POINT(1 4) | POINT(1 0)
3 | 8 | POINT(1 4) | POINT(1 0)
5 | 2 | POINT(3 2) | POINT(2 2)
6 | 4 | POINT(3 2) | POINT(2 2)
1 | 1 | POINT(1 4) | POINT(4 8)
2 | 3 | POINT(1 4) | POINT(4 8)
4 | 6 | POINT(1 4) | POINT(4 8)
3 | 8 | POINT(1 4) | POINT(4 8)
12 | 1 | POINT(6 2) | POINT(7 3)
13 | 5 | POINT(6 2) | POINT(7 3)
1 | 1 | POINT(4 3) | POINT(9 1)
11 | 3 | POINT(4 3) | POINT(9 1)
9 | 6 | POINT(4 3) | POINT(9 1)
10 | 7 | POINT(4 3) | POINT(9 1)
如你看到的:
- 每个输入值可以有多个相等的
groundtruth
值。 - 如果输入值有多个
groundtruth
值,则它们必须全部相等。 - 每个 groundtruth-inputPoint-tuple 都与该 groundtruth 的每个可能情况相连
anchor_id
。 - 两个不同的输入值可以具有相同的对应
groundtruth
值。 - 两个不同的 groundtruth-inputPoint-tuples 可以有相同的
anchor_id
- 两个相同的 groundtruth-inputPoint-tuples 必须有不同
anchor_id
的 s
基准(两个输入值):
- 第 1-6 行:16 毫秒
- 第 8-14 行:48 毫秒
- 第 16-19 行:600 毫秒
详细解释:
Unique (cost=11119.32..11348.33 rows=18 width=72)
Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
CTE points
-> Result (cost=0.00..0.01 rows=1 width=0)
Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
CTE gtps
-> Unique (cost=7659.95..7698.12 rows=1 width=160)
Output: points.p, m.groundtruth, (random())
-> Sort (cost=7659.95..7679.04 rows=7634 width=160)
Output: points.p, m.groundtruth, (random())
Sort Key: points.p, (random())
-> Nested Loop (cost=0.00..6565.63 rows=7634 width=160)
Output: points.p, m.groundtruth, random()
Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
-> CTE Scan on points (cost=0.00..0.02 rows=1 width=32)
Output: points.p
-> Seq Scan on public.measurement m (cost=0.00..535.01 rows=22901 width=132)
Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
-> Sort (cost=3421.18..3478.43 rows=22901 width=72)
Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
-> Nested Loop (cost=0.00..821.29 rows=22901 width=72)
Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
-> CTE Scan on gtps (cost=0.00..0.02 rows=1 width=64)
Output: gtps.p, gtps.groundtruth
-> Seq Scan on public.measurement m (cost=0.00..535.01 rows=22901 width=8)
Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
解释分析:
Unique (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
CTE points
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
CTE gtps
-> Unique (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
-> Sort (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
Sort Key: points.p, (random())
Sort Method: external merge Disk: 1248kB
-> Nested Loop (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
-> CTE Scan on points (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
-> Seq Scan on measurement m (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
-> Sort (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
Sort Method: external merge Disk: 4008kB
-> Nested Loop (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
-> CTE Scan on gtps (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
-> Seq Scan on measurement m (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
Total runtime: 834.626 ms
实时运行时,它应该运行大约 100-1000 个输入值。所以现在需要 35 到 350 秒,这远远不够。
我已经尝试删除这些RANDOM()
功能。这将运行时间(对于 2 个输入值)从大约 670 毫秒减少到大约 530 毫秒。所以这不是目前的主要影响。
如果这样更容易/更快,也可以运行 2 或 3 个单独的查询并在软件中执行某些部分(它在 Ruby on Rails 服务器上运行)。例如随机选择?!
工作正在进行中:
SELECT
m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
measurement m
JOIN
(SELECT unnest(point_array) AS p) AS ps
ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY groundtruth, ps.p
使用此查询它非常快(15ms),但缺少很多:
- 我只需要一个随机行
ps.p
- 这两个数组属于彼此。意思是:里面物品的顺序很重要!
- 这两个数组需要过滤(随机):
对于anchor_id
数组中出现多次的每个数组:保留一个随机数组并删除所有其他数组。这也意味着id
从id
-array 中删除每个已删除的对应项anchor_id
如果anchor_id
并且id
可以存储在元组数组中,那也很好。例如:({[4,1],[6,3],[4,2],[8,5],[4,4]}
约束:每个元组都是唯一的,每个 id(== 示例中的第二个值)都是唯一的,anchor_ids 不是唯一的)。此示例显示没有仍必须应用的过滤器的查询。应用过滤器后,它看起来像这样{[6,3],[4,4],[8,5]}
。
进行中的工作二:
SELECT DISTINCT ON (ps.p)
m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
measurement m
JOIN
(SELECT unnest(point_array) AS p) AS ps
ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()
这现在给出了非常好的结果,并且仍然非常快:16ms
只剩下一件事要做:
ARRAY_AGG(m.anchor_id)
已经随机化,但是:- 它包含很多重复的条目,所以:
- 我想在上面使用类似 DISTINCT 的东西,但是:
- 它必须与
ARRAY_AGG(m.id)
. 这意味着:
如果 DISTINCT 命令保留数组的索引 1、4 和 7anchor_id
,那么它还必须保留数组的索引 1、4 和 7id
(当然删除所有其他索引)