3

想象一个如下所示的帐户表:

   Column   |            Type             | Modifiers 
------------+-----------------------------+-----------
 id         | bigint                      | not null
 signupdate | timestamp without time zone | not null
 canceldate | timestamp without time zone | 

我想按月获取注册和取消数量的报告。

在两个查询中执行此操作非常简单,一个用于按月注册,另一个用于按月取消。有没有一种有效的方法可以在单个查询中完成?某些月份的注册和取消可能为零,结果应该显示为零。

使用这样的源数据:

id    signupDate     cancelDate
 1    2012-01-13     
 2    2012-01-15     2012-02-05
 3    2012-03-01     2012-03-20

我们应该得到以下结果:

Date      signups    cancellations    
2012-01         2                0
2012-02         0                1
2012-03         1                1

我正在使用 postgresql 9.0

第一个答案后更新:

Craig Ringer 在下面提供了一个很好的答案。在我的大约 75k 条记录的数据集中,第一个和第三个示例的表现相似。第二个示例似乎在某处有错误,它返回了不正确的结果。

查看解释分析的结果(我的表确实在 signup_date 上有一个索引),第一个查询返回:

Sort  (cost=2086062.39..2086062.89 rows=200 width=24) (actual time=863.831..863.833 rows=20 loops=1)
  Sort Key: m.m
  Sort Method:  quicksort  Memory: 26kB
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.12..0.13 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)
          InitPlan 1 (returns $0)
            ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=0.040..0.040 rows=1 loops=1)
                  ->  Index Scan using account_created_idx on account  (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.039..0.039 rows=1 loops=1)
                        Index Cond: (created IS NOT NULL)
  InitPlan 3 (returns $2)
    ->  Aggregate  (cost=2991.39..2991.40 rows=1 width=16) (actual time=37.108..37.108 rows=1 loops=1)
          ->  Seq Scan on account  (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.008..14.102 rows=75759 loops=1)
  ->  HashAggregate  (cost=2083057.21..2083063.21 rows=200 width=24) (actual time=863.801..863.806 rows=20 loops=1)
        ->  Nested Loop  (cost=0.00..2077389.49 rows=755696 width=24) (actual time=37.238..805.333 rows=94685 loops=1)
              Join Filter: ((date_trunc('month'::text, a.created) = m.m) OR (date_trunc('month'::text, a.terminateddate) = m.m))
              ->  Function Scan on generate_series m  (cost=0.00..10.00 rows=1000 width=8) (actual time=37.193..37.197 rows=20 loops=1)
              ->  Materialize  (cost=0.00..3361.39 rows=75759 width=16) (actual time=0.004..11.916 rows=75759 loops=20)
                    ->  Seq Scan on account a  (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.003..24.019 rows=75759 loops=1)
Total runtime: 872.183 ms

第三个查询返回:

Sort  (cost=1199951.68..1199952.18 rows=200 width=8) (actual time=732.354..732.355 rows=20 loops=1)
  Sort Key: m.m
  Sort Method:  quicksort  Memory: 26kB
  InitPlan 4 (returns $2)
    ->  Result  (cost=0.12..0.13 rows=1 width=0) (actual time=0.030..0.030 rows=1 loops=1)
          InitPlan 3 (returns $1)
            ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
                  ->  Index Scan using account_created_idx on account  (cost=0.00..8986.92 rows=75759 width=8) (actual time=0.022..0.022 rows=1 loops=1)
                        Index Cond: (created IS NOT NULL)
  InitPlan 5 (returns $3)
    ->  Aggregate  (cost=2991.39..2991.40 rows=1 width=16) (actual time=30.212..30.212 rows=1 loops=1)
          ->  Seq Scan on account  (cost=0.00..2612.59 rows=75759 width=16) (actual time=0.004..8.276 rows=75759 loops=1)
  ->  HashAggregate  (cost=12.50..1196952.50 rows=200 width=8) (actual time=65.226..732.321 rows=20 loops=1)
        ->  Function Scan on generate_series m  (cost=0.00..10.00 rows=1000 width=8) (actual time=30.262..30.264 rows=20 loops=1)
        SubPlan 1
          ->  Aggregate  (cost=2992.34..2992.35 rows=1 width=8) (actual time=21.098..21.098 rows=1 loops=20)
                ->  Seq Scan on account  (cost=0.00..2991.39 rows=379 width=8) (actual time=0.265..20.720 rows=3788 loops=20)
                      Filter: (date_trunc('month'::text, created) = $0)
        SubPlan 2
          ->  Aggregate  (cost=2992.34..2992.35 rows=1 width=8) (actual time=13.994..13.994 rows=1 loops=20)
                ->  Seq Scan on account  (cost=0.00..2991.39 rows=379 width=8) (actual time=2.363..13.887 rows=998 loops=20)
                      Filter: (date_trunc('month'::text, terminateddate) = $0)
Total runtime: 732.487 ms

这无疑使第三个查询看起来更快,但是当我使用“时间”命令从命令行运行查询时,第一个查询始终更快,尽管只有几毫秒。

令我惊讶的是,运行两个单独的查询(一个用于计算注册,一个用于计算取消)的速度要快得多。运行时间不到一半,~300ms vs ~730ms。当然,这留下了更多的工作要在外部完成,但就我的目的而言,它仍然可能是最好的解决方案。以下是单个查询:

select 
    m,
    count(a.id) as "signups"
from
    generate_series(
        (SELECT date_trunc('month',min(signup_date)) FROM accounts), 
        (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts), 
        interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m)
group by m
order by m 
;

select 
    m,
    count(a.id) as "cancellations"
from
    generate_series(
        (SELECT date_trunc('month',min(signup_date)) FROM accounts), 
        (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts), 
        interval '1 month') as m
INNER JOIN accounts a ON (date_trunc('month',a.cancel_date) = m)
group by m
order by m 
;

我已将克雷格的答案标记为正确,但如果你能让它更快,我很想听听

4

2 回答 2

3

以下是三种不同的方法。所有这些都依赖于生成时间序列然后对其进行扫描。一个使用子查询来聚合每个月的数据。一个人根据具有不同标准的系列两次加入表格。另一种形式对时间序列进行单一连接,保留与开始日期或结束日期匹配的行然后在计数中使用谓词来进一步过滤结果。

EXPLAIN ANALYZE将帮助您选择最适合您的数据的方法。

http://sqlfiddle.com/#!12/99c2a/9

测试设置:

CREATE TABLE accounts
    ("id" int, "signup_date" timestamp, "cancel_date" timestamp);

INSERT INTO accounts
    ("id", "signup_date", "cancel_date")
VALUES
    (1, '2012-01-13 00:00:00', NULL),
    (2, '2012-01-15 00:00:00', '2012-02-05'),
    (3, '2012-03-01 00:00:00', '2012-03-20')
;

通过单连接和过滤计数:

SELECT m, 
  count(nullif(date_trunc('month',a.signup_date) = m,'f')), 
  count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (date_trunc('month',a.signup_date) = m OR date_trunc('month',a.cancel_date) = m)
GROUP BY m
ORDER BY m;

通过加入accounts表格两次:

SELECT m, count(s.signup_date) AS n_signups, count(c.cancel_date) AS n_cancels 
FROM generate_series( 
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m LEFT OUTER JOIN accounts s ON (date_trunc('month',s.signup_date) = m) LEFT OUTER JOIN accounts c ON (date_trunc('month',c.cancel_date) = m)
GROUP BY m
ORDER BY m;

或者,使用子查询:

SELECT m, (
  SELECT count(signup_date) 
  FROM accounts 
  WHERE date_trunc('month',signup_date) = m
) AS n_signups, (
  SELECT count(signup_date)
  FROM accounts
  WHERE date_trunc('month',cancel_date) = m
)AS n_cancels 
FROM generate_series( 
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;
于 2012-11-20T00:54:49.660 回答
1

更新后的新答案。

从两个更简单的查询中获得更好的结果,我并不感到震惊;有时以这种方式做事更有效。但是,我的原始答案存在一个问题,该问题将显着影响性能。

Erwin 在另一个答案中准确指出 Pg 不能在日期上使用简单的 b-tree 索引date_trunc,因此最好使用范围。它可以使用在表达式上创建的索引,date_trunc('month',colname)但最好避免创建另一个不必要的索引。

将单扫描和过滤查询改写为使用范围会产生:

SELECT m, 
  count(nullif(date_trunc('month',a.signup_date) = m,'f')), 
  count(nullif(date_trunc('month',a.cancel_date) = m,'f'))
FROM generate_series(
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
INNER JOIN accounts a ON (
  (a.signup_date >= m AND a.signup_date < m + INTERVAL '1' MONTH) 
  OR (a.cancel_date >= m AND a.cancel_date < m + INTERVAL '1' MONTH))
GROUP BY m
ORDER BY m;

在不可索引的条件下没有必要避免date_trunc,所以我只更改为在连接条件中使用区间范围。

原始查询使用 seq 扫描和物化,如果 和 上有索引,现在使用位图索引signup_date扫描cancel_date

在 PostgreSQL 9.2 中,通过添加以下内容可能会获得更好的性能:

CREATE INDEX account_signup_or_cancel ON accounts(signup_date,cancel_date);

并且可能:

CREATE INDEX account_signup_date_nonnull 
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);

CREATE INDEX account_cancel_date_desc_nonnull 
ON accounts(cancel_date DESC) WHERE (cancel_date IS NOT NULL);

允许仅索引扫描。如果没有要测试的实际数据,很难做出可靠的索引建议。

或者,具有改进的可索引过滤条件的基于子查询的方法:

SELECT m, (
  SELECT count(signup_date) 
  FROM accounts 
  WHERE signup_date >= m AND signup_date < m + INTERVAL '1' MONTH
) AS n_signups, (
  SELECT count(cancel_date)
  FROM accounts
  WHERE cancel_date >= m AND cancel_date < m + INTERVAL '1' MONTH
) AS n_cancels 
FROM generate_series( 
  (SELECT date_trunc('month',min(signup_date)) FROM accounts),
  (SELECT date_trunc('month',greatest(max(signup_date),max(cancel_date))) FROM accounts),
  INTERVAL '1' MONTH
) AS m
GROUP BY m
ORDER BY m;

将受益于 and 上的普通 b-tree 索引signup_datecancel_date或者来自:

CREATE INDEX account_signup_date_nonnull 
ON accounts(signup_date) WHERE (signup_date IS NOT NULL);

CREATE INDEX account_cancel_date_nonnull 
ON accounts(cancel_date) WHERE (cancel_date IS NOT NULL);

请记住,您创建的每个索引都会对性能INSERTUPDATE性能造成不利影响,并且会与其他索引竞争并帮助数据获得缓存空间。尝试只创建对其他查询有很大影响并且对其他查询有用的索引。

于 2012-11-21T00:03:37.800 回答