sql - 我如何询问随时间变化的百分比

Question

所以我需要使用 PostgreSQL 并询问每日到 7 天之前的COUNT(DISTINCT userid)百分比变化。

这甚至可能吗？

每天获取 Distinct 用户相当简单：

SELECT COUNT(DISTINCT userid), timestamp::date 
FROM logs
GROUP BY timestamp::date
ORDER BY timestamp::date DESC

如何将其转换为今天到 7 天前的百分比？

score 3 · Accepted Answer

所以我们需要为第 X 天取一个值，为第 X-7 天取第二个值，然后计算 %。
查询可能如下所示：

SELECT a.timestamp, 
       a.cnt, 
       b.cnt cnt_minus_7_day, 
       round( 100.0 *( a.cnt - b.cnt ) / b.cnt , 2 ) change_7_days
from (
    SELECT timestamp::date, COUNT(DISTINCT userid)  cnt
    FROM logs
    GROUP BY timestamp::date
    ORDER BY timestamp::date 
) a
left join (
    SELECT timestamp::date, COUNT(DISTINCT userid)  cnt
    FROM logs
    GROUP BY timestamp::date
    ORDER BY timestamp::date 
) b
ON a.timestamp = b.timestamp - 7
;

您也可以尝试另一个版本 - 这个版本应该更快，
因为似乎 postgresql 不够聪明，并且两次评估相同的子查询，
而不是在内存或临时表中兑现结果。
WITH 子句有助于避免这种情况（比较下面的计划）。

with src as (
    SELECT timestamp::date, COUNT(DISTINCT userid)  cnt
    FROM logs
    GROUP BY timestamp::date
    ORDER BY timestamp::date 
)
SELECT a.timestamp, 
       a.cnt, 
       b.cnt cnt_minus_7_day, 
       round( 100.0 *( a.cnt - b.cnt ) / b.cnt , 2 ) change_7_days
FROM src a
left join src b
on a.timestamp = b.timestamp - 7

这是第一个查询的计划（在我的示例数据上运行）：

"Hash Left Join  (cost=5136.71..5350.93 rows=101 width=20) (actual time=77.778..88.676 rows=101 loops=1)"
"  Hash Cond: (public.logs."timestamp" = (b."timestamp" - 7))"
"  ->  GroupAggregate  (cost=2462.13..2672.31 rows=101 width=8) (actual time=44.398..55.129 rows=101 loops=1)"
"        ->  Sort  (cost=2462.13..2531.85 rows=27889 width=8) (actual time=44.290..48.392 rows=27889 loops=1)"
"              Sort Key: public.logs."timestamp""
"              Sort Method: external merge  Disk: 488kB"
"              ->  Seq Scan on logs  (cost=0.00..402.89 rows=27889 width=8) (actual time=0.037..10.396 rows=27889 loops=1)"
"  ->  Hash  (cost=2673.32..2673.32 rows=101 width=12) (actual time=33.355..33.355 rows=101 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 5kB"
"        ->  Subquery Scan on b  (cost=2462.13..2673.32 rows=101 width=12) (actual time=22.883..33.306 rows=101 loops=1)"
"              ->  GroupAggregate  (cost=2462.13..2672.31 rows=101 width=8) (actual time=22.881..33.288 rows=101 loops=1)"
"                    ->  Sort  (cost=2462.13..2531.85 rows=27889 width=8) (actual time=22.817..26.507 rows=27889 loops=1)"
"                          Sort Key: public.logs."timestamp""
"                          Sort Method: external merge  Disk: 488kB"
"                          ->  Seq Scan on logs  (cost=0.00..402.89 rows=27889 width=8) (actual time=0.014..3.696 rows=27889 loops=1)"
"Total runtime: 100.360 ms"

对于第二个版本：

"Hash Left Join  (cost=2675.59..2680.64 rows=101 width=20) (actual time=60.612..60.785 rows=101 loops=1)"
"  Hash Cond: (a."timestamp" = (b."timestamp" - 7))"
"  CTE src"
"    ->  GroupAggregate  (cost=2462.13..2672.31 rows=101 width=8) (actual time=46.498..60.425 rows=101 loops=1)"
"          ->  Sort  (cost=2462.13..2531.85 rows=27889 width=8) (actual time=46.382..51.113 rows=27889 loops=1)"
"                Sort Key: logs."timestamp""
"                Sort Method: external merge  Disk: 488kB"
"                ->  Seq Scan on logs  (cost=0.00..402.89 rows=27889 width=8) (actual time=0.037..8.945 rows=27889 loops=1)"
"  ->  CTE Scan on src a  (cost=0.00..2.02 rows=101 width=12) (actual time=46.504..46.518 rows=101 loops=1)"
"  ->  Hash  (cost=2.02..2.02 rows=101 width=12) (actual time=14.084..14.084 rows=101 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 5kB"
"        ->  CTE Scan on src b  (cost=0.00..2.02 rows=101 width=12) (actual time=0.002..14.033 rows=101 loops=1)"
"Total runtime: 67.799 ms"

score 3 · Accepted Answer

您实际上不需要子查询或 CTE。您可以SELECT使用窗口函数lag()来做一个：

_{我使用ts作为列名而不是因为使用保留字（SQL 标准）或 Postgres 函数/类型名称作为标识符timestmap是不明智的。}

SELECT ts::date
      ,     ((count(DISTINCT userid) * 10000)
        / lag(count(DISTINCT userid), 7) OVER (ORDER BY ts::date))::real
        / 100 - 100 AS pct_change_since_7_days_ago
      ,count(DISTINCT userid) AS ct
      ,lag(count(DISTINCT userid), 7) OVER (ORDER BY ts::date) AS ct_7_days_ago
FROM   logs
GROUP  BY 1
ORDER  BY 1 DESC;

我安排了性能百分比的计算。这样，我们无需使用round()也需要强制转换为numeric.
窗口函数可以应用于同一查询级别的聚合函数，这就是lag(count(DISTINCT userid), 7) OVER (ORDER BY ts::date)有效的原因。窗口函数lead()并lag()采用附加参数。我选择第 7 行
注意：这需要每天至少一行，否则会计算错误。如果可能存在差距，我会选择@kordirko 的第二个查询，只是没有ORDER BYCTE 中的，它应该应用于外部查询。
或者您可以创建一个包含它的日期generate_series()列表LEFT JOIN。就像这里演示的那样：
检索行数并在没有行时返回 0

除以 0

如果“7 天前”不存在任何行，则结果为 NULL - 对于@kordirko 版本中的 LEFT JOIN 以及 for lag()- 它很好地代表了现实（“计数未知”）并作为自动保护免受除法0 .

但是，如果useridcan beNULL，除以 0成为可能，我们需要抓住这种情况。为什么会出现悖论效应？

与其他聚合函数不同，count()从不返回NULL。相反NULL，值只是不计算在内。
但是，如果没有找到“7 天前”的行，我们会得到NULL计数，因为整个表达式是NULL:count()甚至没有被执行——在这种情况下，这恰好适合我们。
但是，如果找到一行或多行，但使用userid IS NULL，我们会得到的计数，这将引发除以 00的异常。

对于平原count(userid)，我们可以改用它count(*)来防止这种情况。但这只是不可能的count(DISTINCT userid)- 并且可能会或可能不会返回您正在寻找的计数。

NULLIF(count(DISTINCT userid), 0)在这种情况下使用。

sql - 我如何询问随时间变化的百分比

2 回答 2

除以 0

Related

Reference