0

我正在使用已适应 MPP 的 PostgreSQL 8.2 的分叉版本。我试图从两个较大的表中计算出一系列 tiestamps 的最大下限。以下是上述表格的示例:

Table A
|source_ip (inet type)  |s_time (date type)    |
------------------------------------------------
|10.50.43.200           | 2013-02-21 01:47:08  |
|10.50.43.200           | 2013-02-21 01:47:38  |
|10.50.43.200           | 2013-02-21 01:47:41  |
|10.50.43.200           | 2013-02-25 17:05:00  |
|10.50.43.200           | 2013-02-25 17:05:03  |
|10.50.43.200           | 2013-02-25 17:05:04  |
|10.50.43.200           | 2013-02-25 17:05:34  |
|10.50.43.200           | 2013-02-25 17:10:01  |
|10.50.43.200           | 2013-02-25 17:12:52  |

Table B
|source_ip (inet type)  |mac (macaddr type)   |l_time (date type)    |
----------------------------------------------------------------------
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-20 22:33:47  |
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-20 23:07:32  |
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-20 23:13:04  |
|10.50.43.200           | 00:24:d7:99:e9:0c   | 2013-02-21 00:02:56  |
|10.50.43.200           | 00:24:d7:99:68:14   | 2013-02-25 17:04:56  |
|10.50.43.200           | 00:24:d7:99:68:14   | 2013-02-25 17:04:59  |
|10.50.43.200           | 00:24:d7:99:68:14   | 2013-02-25 17:26:15  |

对于 table 中的每一行,A我想加入一个附加列,它是 Table 中每个时间戳的“最大下限” B。也就是说,我想要一列包含 Table B 中所有值中的最大时间,并且也小于或等于 table 中的相应时间A。我期望的输出如下所示:

 OUTPUT
 ------------------------------------------------------------
 |10.50.43.200  |2013-02-21 01:47:38  |2013-02-21 00:02:56  |
 |10.50.43.200  |2013-02-21 01:47:41  |2013-02-21 00:02:56  |
 |10.50.43.200  |2013-02-25 17:05:00  |2013-02-25 17:04:59  |
 |10.50.43.200  |2013-02-25 17:05:03  |2013-02-25 17:04:59  |
 |10.50.43.200  |2013-02-25 17:05:04  |2013-02-25 17:04:59  |
 |10.50.43.200  |2013-02-25 17:05:34  |2013-02-25 17:04:59  |

以下查询是我想出的,但我不确定使用max()聚合函数是否是实现此目的的最佳方式。

所以,我的问题是 - 我们可以重写下面的查询而不使用max()在大型数据集(100+ 百万范围内)上更快吗?

SELECT a.source_ip,
a.s_Time, max(b.l_Time) AS max_time
    FROM table_a AS a
    INNER JOIN
    table_b AS b
    ON (a.source_ip = b.source_ip AND a.s_time > b.l_time)
    GROUP BY a.source_ip, b.sourcemac, a.s_time
    ORDER BY a.s_time asc;

这是解释计划:

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Gather Motion 72:1  (slice1; segments: 72)  (cost=1519175930.51..1519305453.91 rows=143915 width=48)
   ->  HashAggregate  (cost=1519175930.51..1519305453.91 rows=143915 width=48)
         Group By: a.source_ip, a.s_time
         ->  Hash Join  (cost=991681.79..1169135585.55 rows=648222862 width=23)
               Hash Cond: a.source_ip = b.source_ip
               Join Filter: a.s_time > b.l_time
               ->  Append-only Columnar Scan on  a  (cost=0.00..1083707.12 rows=1439149 width=15)
               ->  Hash  (cost=487360.24..487360.24 rows=560358 width=15)
                     ->  Seq Scan on  b  (cost=0.00..487360.24 rows=560358 width=15)
(9 rows)

我知道我可以将source_ip-s 散列到 bigints 以获得更快的连接。我还认为可能值得对连接中使用的索引列进行试验,但我不确定最佳优化策略是什么,并且会喜欢 StackOverflow 社区中优秀专家组的任何意见。我们也尝试了rank()窗口函数,但它在我们使用的实现中存在问题,并且是我们测试过的此类查询中性能最差的函数,因此理想的策略有望避免任何窗口函数。

编辑:为表添加了索引source_ip,并使用帖子中的建议重写了查询:start_yimeALIMIT 1

                                                           QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 72:1  (slice2; segments: 72)  (cost=1624120.24..7442384075819.75 rows=145921 width=48)
   ->  HashAggregate  (cost=1624120.24..7442384075819.75 rows=145921 width=48)
         Group By: a.src, a.start_time
         ->  Append-only Columnar Scan on a  (cost=0.00..1098806.16 rows=1459206 width=15)
         SubPlan 1
           ->  Limit  (cost=708374.49..708374.51 rows=1 width=15)
                 ->  Limit  (cost=708374.49..708374.49 rows=1 width=15)
                       ->  Sort  (cost=708374.49..708376.35 rows=11 width=15)
                             Sort Key (Limit): b.source_ip, b.start_time
                             ->  Result  (cost=708339.65..708347.10 rows=11 width=15)
                                   Filter: $0 = b.source_ip AND $1 > b.start_time
                                   ->  Materialize for deadlock safety  (cost=708339.65..708347.10 rows=11 width=15)
                                         ->  Broadcast Motion 72:72  (slice1; segments: 72)  (cost=0.00..708338.90 rows=11 width=15)
                                               ->  Seq Scan on b  (cost=0.00..708338.90 rows=11 width=15)
4

2 回答 2

1

Standard way of finding the maximum without using MAX() or LIMIT is to use NOT EXISTS (a record with a larger value), like:

SELECT a.src, a.s_Time
        , b.l_Time AS max_time
    FROM table_a AS a
    JOIN table_b AS b ON b.source_ip = a.source_ip 
                     AND b.l_time < a.s_time
    WHERE NOT EXISTS (
      SELECT *
      FROM table_b nx
      WHERE nx.b.source_ip = b.source_ip
        AND nx.l_time < a.s_time
        AND nx.l_time > b.l_time
    );
于 2013-05-29T08:50:34.497 回答
1
SELECT a.src,
    a.s_Time,
    (SELECT b.l_time AS max_time 
       FROM table_b AS b WHERE a.source_ip = b.source_ip
         AND a.s_time > b.l_time 
         ORDER BY b.source_ip DESC, b.l_time DESC /* index on (source_ip, l_time) */
         LIMIT 1)
FROM table_a AS a
ORDER BY a.start_time;

我省略了 GROUP BY 因为我没有看到 a.src 并且我不确定是否a.s_timea.start_time是不同的列。

无论如何,这个想法是 PG 对索引LIMIT 1查询非常聪明(至少,最近的版本是;不保证 8.2)。如果需要,最新版本可能足够聪明,可以转换MAX为等效LIMIT 1查询,但我几乎可以肯定那是在 8.2 之后。

于 2013-05-29T07:21:40.617 回答