我正在使用已适应 MPP 的 PostgreSQL 8.2 的分叉版本。我试图从两个较大的表中计算出一系列 tiestamps 的最大下限。以下是上述表格的示例:
Table A
|source_ip (inet type) |s_time (date type) |
------------------------------------------------
|10.50.43.200 | 2013-02-21 01:47:08 |
|10.50.43.200 | 2013-02-21 01:47:38 |
|10.50.43.200 | 2013-02-21 01:47:41 |
|10.50.43.200 | 2013-02-25 17:05:00 |
|10.50.43.200 | 2013-02-25 17:05:03 |
|10.50.43.200 | 2013-02-25 17:05:04 |
|10.50.43.200 | 2013-02-25 17:05:34 |
|10.50.43.200 | 2013-02-25 17:10:01 |
|10.50.43.200 | 2013-02-25 17:12:52 |
Table B
|source_ip (inet type) |mac (macaddr type) |l_time (date type) |
----------------------------------------------------------------------
|10.50.43.200 | 00:24:d7:99:e9:0c | 2013-02-20 22:33:47 |
|10.50.43.200 | 00:24:d7:99:e9:0c | 2013-02-20 23:07:32 |
|10.50.43.200 | 00:24:d7:99:e9:0c | 2013-02-20 23:13:04 |
|10.50.43.200 | 00:24:d7:99:e9:0c | 2013-02-21 00:02:56 |
|10.50.43.200 | 00:24:d7:99:68:14 | 2013-02-25 17:04:56 |
|10.50.43.200 | 00:24:d7:99:68:14 | 2013-02-25 17:04:59 |
|10.50.43.200 | 00:24:d7:99:68:14 | 2013-02-25 17:26:15 |
对于 table 中的每一行,A
我想加入一个附加列,它是 Table 中每个时间戳的“最大下限” B
。也就是说,我想要一列包含 Table B 中所有值中的最大时间,并且也小于或等于 table 中的相应时间A
。我期望的输出如下所示:
OUTPUT
------------------------------------------------------------
|10.50.43.200 |2013-02-21 01:47:38 |2013-02-21 00:02:56 |
|10.50.43.200 |2013-02-21 01:47:41 |2013-02-21 00:02:56 |
|10.50.43.200 |2013-02-25 17:05:00 |2013-02-25 17:04:59 |
|10.50.43.200 |2013-02-25 17:05:03 |2013-02-25 17:04:59 |
|10.50.43.200 |2013-02-25 17:05:04 |2013-02-25 17:04:59 |
|10.50.43.200 |2013-02-25 17:05:34 |2013-02-25 17:04:59 |
以下查询是我想出的,但我不确定使用max()
聚合函数是否是实现此目的的最佳方式。
所以,我的问题是 - 我们可以重写下面的查询而不使用max()
在大型数据集(100+ 百万范围内)上更快吗?
SELECT a.source_ip,
a.s_Time, max(b.l_Time) AS max_time
FROM table_a AS a
INNER JOIN
table_b AS b
ON (a.source_ip = b.source_ip AND a.s_time > b.l_time)
GROUP BY a.source_ip, b.sourcemac, a.s_time
ORDER BY a.s_time asc;
这是解释计划:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Gather Motion 72:1 (slice1; segments: 72) (cost=1519175930.51..1519305453.91 rows=143915 width=48)
-> HashAggregate (cost=1519175930.51..1519305453.91 rows=143915 width=48)
Group By: a.source_ip, a.s_time
-> Hash Join (cost=991681.79..1169135585.55 rows=648222862 width=23)
Hash Cond: a.source_ip = b.source_ip
Join Filter: a.s_time > b.l_time
-> Append-only Columnar Scan on a (cost=0.00..1083707.12 rows=1439149 width=15)
-> Hash (cost=487360.24..487360.24 rows=560358 width=15)
-> Seq Scan on b (cost=0.00..487360.24 rows=560358 width=15)
(9 rows)
我知道我可以将source_ip
-s 散列到 bigints 以获得更快的连接。我还认为可能值得对连接中使用的索引列进行试验,但我不确定最佳优化策略是什么,并且会喜欢 StackOverflow 社区中优秀专家组的任何意见。我们也尝试了rank()
窗口函数,但它在我们使用的实现中存在问题,并且是我们测试过的此类查询中性能最差的函数,因此理想的策略有望避免任何窗口函数。
编辑:为表添加了索引source_ip
,并使用帖子中的建议重写了查询:start_yime
A
LIMIT 1
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 72:1 (slice2; segments: 72) (cost=1624120.24..7442384075819.75 rows=145921 width=48)
-> HashAggregate (cost=1624120.24..7442384075819.75 rows=145921 width=48)
Group By: a.src, a.start_time
-> Append-only Columnar Scan on a (cost=0.00..1098806.16 rows=1459206 width=15)
SubPlan 1
-> Limit (cost=708374.49..708374.51 rows=1 width=15)
-> Limit (cost=708374.49..708374.49 rows=1 width=15)
-> Sort (cost=708374.49..708376.35 rows=11 width=15)
Sort Key (Limit): b.source_ip, b.start_time
-> Result (cost=708339.65..708347.10 rows=11 width=15)
Filter: $0 = b.source_ip AND $1 > b.start_time
-> Materialize for deadlock safety (cost=708339.65..708347.10 rows=11 width=15)
-> Broadcast Motion 72:72 (slice1; segments: 72) (cost=0.00..708338.90 rows=11 width=15)
-> Seq Scan on b (cost=0.00..708338.90 rows=11 width=15)