sql - Amazon Redshift/PostgreSQL 中的高效 GROUP BY 表达式

Question

在分析处理中，通常需要将“不重要”的数据组折叠到结果表中的单行中。一种方法是通过 CASE 表达式对不重要的组进行 GROUP BY，其中不重要的组通过 CASE 表达式返回单个值（例如，组为 NULL）合并为单个行。这个问题是关于在 Amazon Redshift 中执行此分组的有效方法，它基于 ParAccel：在功能方面接近 PosgreSQL 8.0。

例如，考虑一个表中的 GROUP BY，type其中url每一行都是一个 URL 访问。目标是执行聚合，以便为 URL 访问计数超过某个阈值的每个 (type, url) 对发出一行，并为访问的所有(type, url) 对发出一个 (type, NULL) 行计数低于该阈值。结果表中的其余列将具有基于此分组的 SUM/COUNT 聚合。

例如下面的数据

+------+----------------------+-----------------------+
| type | url                  | < 50+ other columns > |
+------+----------------------+-----------------------+
|  A   | http://popular.com   |                       |
|  A   | http://popular.com   |                       |
|  A   | < 9997 more times>   |                       |
|  A   | http://popular.com   |                       |
|  A   | http://small-one.com |                       |
|  B   | http://tiny.com      |                       |
|  B   | http://tiny-too.com  |                       |

应生成以下结果表，阈值为 10,000

+------+------------------------------------+--------------------------+
| type | url                  | visit_count | < SUM/COUNT aggregates > |
+------+------------------------------------+--------------------------+
|  A   | http://popular.com   |       10000 |                          |
|  A   |                      |           1 |                          |
|  B   |                      |           2 |                          |

概括：

Amazon Redshift 有一定的子查询相关限制，需要小心处理。下面的 Gordon Linoff 答案（已接受的答案）显示了如何使用双重聚合执行 GROUP BY a CASE 表达式，并在结果列和外部 GROUP BY 子句中复制表达式。

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt
from temp_counts
group by type, (case when cnt >= 10000 then url end)

进一步的测试表明，双重聚合可以“展开”成一个 UNION ALL 独立查询，涉及每个独立的 CASE 表达式。在这个具有大约 200M 行的样本数据集的特殊情况下，这种方法的执行速度始终快了大约 30%。但是，该结果是特定于模式和数据的。

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select * from temp_counts WHERE cnt >= 10000
UNION ALL
SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts 
WHERE cnt < 10000 
GROUP BY type

这表明了在 Amazon Redshift 中实施和优化任意脱节分组和汇总的两种通用模式。如果性能对您很重要，请对两者进行基准测试。

score 3 · Accepted Answer

您可以使用两个聚合来执行此操作：

select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt
from (select type, url, count(*) as cnt
      from t
      group by type, url
     ) t
group by type, (case when cnt > XXX then url end)
order by type, sum(cnt) desc;

score 1 · Accepted Answer

首先，你分组type, url。
然后你第二次分组type, case when visit_count < 10000 then NULL else url。

我使用过 SQL Server 语法，我希望它也适用于 Postgres。

sql - Amazon Redshift/PostgreSQL 中的高效 GROUP BY 表达式

2 回答 2

Related

Reference