sql - 优化涉及一组时间的最大下限PostgreSQL的查询

Question

我正在使用已适应 MPP 的 PostgreSQL 8.2 的分叉版本。我试图从两个较大的表中计算出一系列 tiestamps 的最大下限。以下是上述表格的示例：

Table A
|source_ip (inet type)  |s_time (date type)    |
------------------------------------------------
|10.50.43.200           | 2013-02-21 01:47:08  |
|10.50.43.200           | 2013-02-21 01:47:38  |
|10.50.43.200           | 2013-02-21 01:47:41  |
|10.50.43.200           | 2013-02-25 17:05:00  |
|10.50.43.200           | 2013-02-25 17:05:03  |
|10.50.43.200           | 2013-02-25 17:05:04  |
|10.50.43.200           | 2013-02-25 17:05:34  |
|10.50.43.200           | 2013-02-25 17:10:01  |
|10.50.43.200           | 2013-02-25 17:12:52  |

Table B
|source_ip (inet type)  |mac (macaddr type)   |l_time (date type)    |
----------------------------------------------------------------------
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-20 22:33:47  |
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-20 23:07:32  |
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-20 23:13:04  |
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-21 00:02:56  |
|10.50.43.200           | 00:24:d7:99:68:14   | 2013-02-25 17:04:56  |
|10.50.43.200           | 00:24:d7:99:68:14   | 2013-02-25 17:04:59  |
|10.50.43.200           | 00:24:d7:99:68:14   | 2013-02-25 17:26:15  |

对于 table 中的每一行，A我想加入一个附加列，它是 Table 中每个时间戳的“最大下限” B。也就是说，我想要一列包含 Table B 中所有值中的最大时间，并且也小于或等于 table 中的相应时间A。我期望的输出如下所示：

 OUTPUT
 ------------------------------------------------------------
 |10.50.43.200  |2013-02-21 01:47:38  |2013-02-21 00:02:56  |
 |10.50.43.200  |2013-02-21 01:47:41  |2013-02-21 00:02:56  |
 |10.50.43.200  |2013-02-25 17:05:00  |2013-02-25 17:04:59  |
 |10.50.43.200  |2013-02-25 17:05:03  |2013-02-25 17:04:59  |
 |10.50.43.200  |2013-02-25 17:05:04  |2013-02-25 17:04:59  |
 |10.50.43.200  |2013-02-25 17:05:34  |2013-02-25 17:04:59  |

以下查询是我想出的，但我不确定使用max()聚合函数是否是实现此目的的最佳方式。

所以，我的问题是 - 我们可以重写下面的查询而不使用max()在大型数据集（100+ 百万范围内）上更快吗？

SELECT a.source_ip,
a.s_Time, max(b.l_Time) AS max_time
    FROM table_a AS a
    INNER JOIN
    table_b AS b
    ON (a.source_ip = b.source_ip AND a.s_time > b.l_time)
    GROUP BY a.source_ip, b.sourcemac, a.s_time
    ORDER BY a.s_time asc;

这是解释计划：

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Gather Motion 72:1  (slice1; segments: 72)  (cost=1519175930.51..1519305453.91 rows=143915 width=48)
   ->  HashAggregate  (cost=1519175930.51..1519305453.91 rows=143915 width=48)
         Group By: a.source_ip, a.s_time
         ->  Hash Join  (cost=991681.79..1169135585.55 rows=648222862 width=23)
               Hash Cond: a.source_ip = b.source_ip
               Join Filter: a.s_time > b.l_time
               ->  Append-only Columnar Scan on  a  (cost=0.00..1083707.12 rows=1439149 width=15)
               ->  Hash  (cost=487360.24..487360.24 rows=560358 width=15)
                     ->  Seq Scan on  b  (cost=0.00..487360.24 rows=560358 width=15)
(9 rows)

我知道我可以将source_ip-s 散列到 bigints 以获得更快的连接。我还认为可能值得对连接中使用的索引列进行试验，但我不确定最佳优化策略是什么，并且会喜欢 StackOverflow 社区中优秀专家组的任何意见。我们也尝试了rank()窗口函数，但它在我们使用的实现中存在问题，并且是我们测试过的此类查询中性能最差的函数，因此理想的策略有望避免任何窗口函数。

编辑：为表添加了索引source_ip，并使用帖子中的建议重写了查询：start_yimeALIMIT 1

                                                           QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 72:1  (slice2; segments: 72)  (cost=1624120.24..7442384075819.75 rows=145921 width=48)
   ->  HashAggregate  (cost=1624120.24..7442384075819.75 rows=145921 width=48)
         Group By: a.src, a.start_time
         ->  Append-only Columnar Scan on a  (cost=0.00..1098806.16 rows=1459206 width=15)
         SubPlan 1
           ->  Limit  (cost=708374.49..708374.51 rows=1 width=15)
                 ->  Limit  (cost=708374.49..708374.49 rows=1 width=15)
                       ->  Sort  (cost=708374.49..708376.35 rows=11 width=15)
                             Sort Key (Limit): b.source_ip, b.start_time
                             ->  Result  (cost=708339.65..708347.10 rows=11 width=15)
                                   Filter: $0 = b.source_ip AND $1 > b.start_time
                                   ->  Materialize for deadlock safety  (cost=708339.65..708347.10 rows=11 width=15)
                                         ->  Broadcast Motion 72:72  (slice1; segments: 72)  (cost=0.00..708338.90 rows=11 width=15)
                                               ->  Seq Scan on b  (cost=0.00..708338.90 rows=11 width=15)

score 1 · Accepted Answer

Standard way of finding the maximum without using MAX() or LIMIT is to use NOT EXISTS (a record with a larger value), like:

SELECT a.src, a.s_Time
        , b.l_Time AS max_time
    FROM table_a AS a
    JOIN table_b AS b ON b.source_ip = a.source_ip 
                     AND b.l_time < a.s_time
    WHERE NOT EXISTS (
      SELECT *
      FROM table_b nx
      WHERE nx.b.source_ip = b.source_ip
        AND nx.l_time < a.s_time
        AND nx.l_time > b.l_time
    );

score 1 · Accepted Answer

SELECT a.src,
    a.s_Time,
    (SELECT b.l_time AS max_time 
       FROM table_b AS b WHERE a.source_ip = b.source_ip
         AND a.s_time > b.l_time 
         ORDER BY b.source_ip DESC, b.l_time DESC /* index on (source_ip, l_time) */
         LIMIT 1)
FROM table_a AS a
ORDER BY a.start_time;

我省略了 GROUP BY 因为我没有看到 a.src 并且我不确定是否a.s_time和a.start_time是不同的列。

无论如何，这个想法是 PG 对索引LIMIT 1查询非常聪明（至少，最近的版本是；不保证 8.2）。如果需要，最新版本可能足够聪明，可以转换MAX为等效LIMIT 1查询，但我几乎可以肯定那是在 8.2 之后。

sql - 优化涉及一组时间的最大下限PostgreSQL的查询

2 回答 2

Related

Reference