5

相关的,前面的问题:
在按值(而不是列)分组后从组中选择一个随机条目?

我当前的查询如下所示:

WITH
  points AS (
    SELECT unnest(array_of_points) AS p
  ),

 gtps AS (
   SELECT DISTINCT ON(points.p)
     points.p, m.groundtruth
   FROM measurement m, points
   WHERE st_distance(m.groundtruth, points.p) < distance
   ORDER BY points.p, RANDOM()
 )

SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()

语义:

  1. 有两个输入值:

    • 第 4 行:点数组array_of_points
    • 第 12 行:一个双精度数:distance
  2. 第一段(第 1-6 行):

    • 从点数组创建一个表以用于...
  3. 第二段(第 8-14 行):

    • 对于表内的每个点points:从表中获取一个随机(!)groundtruthmeasurement,其距离 <distance
    • 将这些元组保存在gtps表中
  4. 第三段(第 16-19 行):

    • 对于表中groundtruth的每个值gtps:获取所有anchor_id值并...
    • 如果一个anchor_id值不是唯一的:然后随机选择一个
  5. 输出:id, anchor_id, groundtruth, p(来自 的输入值array_of_points

示例表:

id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 8         | POINT(1 4)  | ...
4  | 6         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 6         | POINT(4 3)  | ...
10 | 7         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...
-----------------------------------
12 | 1         | POINT(6 2)  | ...
13 | 5         | POINT(6 2)  | ...

示例结果:

id  | anchor_id | groundtruth | p
-----------------------------------------
1   | 1         | POINT(1 4)  | POINT(1 0)
2   | 3         | POINT(1 4)  | POINT(1 0)
4   | 6         | POINT(1 4)  | POINT(1 0)
3   | 8         | POINT(1 4)  | POINT(1 0)
5   | 2         | POINT(3 2)  | POINT(2 2)
6   | 4         | POINT(3 2)  | POINT(2 2)
1   | 1         | POINT(1 4)  | POINT(4 8)
2   | 3         | POINT(1 4)  | POINT(4 8)
4   | 6         | POINT(1 4)  | POINT(4 8)
3   | 8         | POINT(1 4)  | POINT(4 8)
12  | 1         | POINT(6 2)  | POINT(7 3)
13  | 5         | POINT(6 2)  | POINT(7 3)
1   | 1         | POINT(4 3)  | POINT(9 1)
11  | 3         | POINT(4 3)  | POINT(9 1)
9   | 6         | POINT(4 3)  | POINT(9 1)
10  | 7         | POINT(4 3)  | POINT(9 1)

如你看到的:

  • 每个输入值可以有多个相等的groundtruth值。
  • 如果输入值有多个groundtruth值,则它们必须全部相等。
  • 每个 groundtruth-inputPoint-tuple 都与该 groundtruth 的每个可能情况相连anchor_id
  • 两个不同的输入值可以具有相同的对应groundtruth值。
  • 两个不同的 groundtruth-inputPoint-tuples 可以有相同的anchor_id
  • 两个相同的 groundtruth-inputPoint-tuples 必须有不同anchor_id的 s

基准(两个输入值):

  • 第 1-6 行:16 毫秒
  • 第 8-14 行:48 毫秒
  • 第 16-19 行:600 毫秒

详细解释:

Unique  (cost=11119.32..11348.33 rows=18 width=72)
  Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0)
          Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160)
          Output: points.p, m.groundtruth, (random())
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160)
                Output: points.p, m.groundtruth, (random())
                Sort Key: points.p, (random())
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160)
                      Output: points.p, m.groundtruth, random()
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32)
                            Output: points.p
                      ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=132)
                            Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72)
        Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72)
              Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64)
                    Output: gtps.p, gtps.groundtruth
              ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=8)
                    Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"

解释分析:

Unique  (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
                Sort Key: points.p, (random())
                Sort Method: external merge  Disk: 1248kB
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
                      ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
        Sort Method: external merge  Disk: 4008kB
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
              ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
Total runtime: 834.626 ms

实时运行时,它应该运行大约 100-1000 个输入值。所以现在需要 35 到 350 秒,这远远不够。

我已经尝试删除这些RANDOM()功能。这将运行时间(对于 2 个输入值)从大约 670 毫秒减少到大约 530 毫秒。所以这不是目前的主要影响。

如果这样更容易/更快,也可以运行 2 或 3 个单独的查询并在软件中执行某些部分(它在 Ruby on Rails 服务器上运行)。例如随机选择?!

工作正在进行中:

SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY groundtruth, ps.p

使用此查询它非常快(15ms),但缺少很多:

  • 我只需要一个随机行ps.p
  • 这两个数组属于彼此。意思是:里面物品的顺序很重要!
  • 这两个数组需要过滤(随机):
    对于anchor_id数组中出现多次的每个数组:保留一个随机数组并删除所有其他数组。这也意味着idid-array 中删除每个已删除的对应项anchor_id

如果anchor_id并且id可以存储在元组数组中,那也很好。例如:({[4,1],[6,3],[4,2],[8,5],[4,4]}约束:每个元组都是唯一的,每个 id(== 示例中的第二个值)都是唯一的,anchor_ids 不是唯一的)。此示例显示没有仍必须应用的过滤器的查询。应用过滤器后,它看起来像这样{[6,3],[4,4],[8,5]}

进行中的工作二:

SELECT DISTINCT ON (ps.p)
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()

这现在给出了非常好的结果,并且仍然非常快:16ms
只剩下一件事要做:

  • ARRAY_AGG(m.anchor_id)已经随机化,但是:
  • 它包含很多重复的条目,所以:
  • 我想在上面使用类似 DISTINCT 的东西,但是:
  • 它必须与ARRAY_AGG(m.id). 这意味着:
    如果 DISTINCT 命令保留数组的索引 1、4 和 7 anchor_id,那么它还必须保留数组的索引 1、4 和 7 id(当然删除所有其他索引)
4

1 回答 1

2

如果可以将 anchor_id 和 id 存储在元组数组中,那就太好了。

多维数组的聚合函数

我想您为此创建了一个二维数组。这比ARRAY of record. 标准array_agg()不能聚合多维数组。但是您可以很容易地为此编写自己的聚合函数:

CREATE AGGREGATE array_agg_mult (anyarray)  (
    SFUNC     = array_cat
   ,STYPE     = anyarray
   ,INITCOND  = '{}'
);

阅读此相关答案中的解释:
Selecting data into a Postgres array

对于数组中出现多次的每个anchor_id:保留一个随机的并删除所有其他的。这也意味着从 id-array 中为每个删除的 anchor_id 删除相应的 id

询问

SELECT DISTINCT ON (p)
       p, groundtruth, array_agg_mult(ARRAY[ARRAY[anchor_id, id]]) AS ids
FROM (
   SELECT DISTINCT ON (ps.p, m.groundtruth, m.anchor_id)
          ps.p, m.groundtruth, m.anchor_id, m.id
   FROM  (SELECT unnest(point_array) AS p) AS ps
   JOIN   measurement m ON ST_DWithin(ps.p, m.groundtruth, distance)
   ORDER  BY ps.p, m.groundtruth, m.anchor_id, random()
   ) x
GROUP  BY p, groundtruth
ORDER  BY p, random();
  • 如果有多个对等点,子查询x会变得不同anchor_id(p, groundtruth)并随机选择一行。这样连接就anchor_id - id保持不变。

  • 外部查询聚合一个二维数组,如您所愿,按anchor_id. 如果您想anchor_id随机订购,请再次使用 random :

    array_agg_mult(ARRAY[ARRAY[anchor_id, id]] ORDER BY random())
    
  • 最后,再次随机DISTINCT ON选择 1groundtruth个。p

于 2013-02-27T02:05:33.843 回答