sql - 优化大型子表的日期查询：GiST 还是 GIN？

Question

问题

72 个子表，每个表都有一个年份索引和一个站点索引，定义如下：

CREATE TABLE climate.measurement_12_013
(
-- Inherited from table climate.measurement_12_013:  id bigint NOT NULL DEFAULT nextval('climate.measurement_id_seq'::regclass),
-- Inherited from table climate.measurement_12_013:  station_id integer NOT NULL,
-- Inherited from table climate.measurement_12_013:  taken date NOT NULL,
-- Inherited from table climate.measurement_12_013:  amount numeric(8,2) NOT NULL,
-- Inherited from table climate.measurement_12_013:  category_id smallint NOT NULL,
-- Inherited from table climate.measurement_12_013:  flag character varying(1) NOT NULL DEFAULT ' '::character varying,
  CONSTRAINT measurement_12_013_category_id_check CHECK (category_id = 7),
  CONSTRAINT measurement_12_013_taken_check CHECK (date_part('month'::text, taken)::integer = 12)
)
INHERITS (climate.measurement)

CREATE INDEX measurement_12_013_s_idx
  ON climate.measurement_12_013
  USING btree
  (station_id);
CREATE INDEX measurement_12_013_y_idx
  ON climate.measurement_12_013
  USING btree
  (date_part('year'::text, taken));

（稍后添加外键约束。）

由于全表扫描，以下查询运行非常缓慢：

SELECT
  count(1) AS measurements,
  avg(m.amount) AS amount
FROM
  climate.measurement m
WHERE
  m.station_id IN (
    SELECT
      s.id
    FROM
      climate.station s,
      climate.city c
    WHERE
        /* For one city... */
        c.id = 5182 AND

        /* Where stations are within an elevation range... */
        s.elevation BETWEEN 0 AND 3000 AND

        /* and within a specific radius... */
        6371.009 * SQRT( 
          POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
            (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
              POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
        ) <= 50
    ) AND

  /* Data before 1900 is shaky; insufficient after 2009. */
  extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

  /* Whittled down by category... */
  m.category_id = 1 AND

  /* Between the selected days and years... */
  m.taken BETWEEN
   /* Start date. */
   (extract( YEAR FROM m.taken )||'-01-01')::date AND
    /* End date. Calculated by checking to see if the end date wraps
       into the next year. If it does, then add 1 to the current year.
    */
    (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
      sign(
        (extract( YEAR FROM m.taken )||'-12-31')::date -
        (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
    ) AS text)||'-12-31')::date
GROUP BY
  extract( YEAR FROM m.taken )

迟缓来自查询的这一部分：

  m.taken BETWEEN
    /* Start date. */
  (extract( YEAR FROM m.taken )||'-01-01')::date AND
    /* End date. Calculated by checking to see if the end date wraps
      into the next year. If it does, then add 1 to the current year.
    */
    (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
      sign(
        (extract( YEAR FROM m.taken )||'-12-31')::date -
        (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
    ) AS text)||'-12-31')::date

这部分查询匹配选定的日期。例如，如果用户想要查看有数据的所有年份的 6 月 1 日至 7 月 1 日之间的数据，则上述子句仅与那些日子匹配。如果用户想查看 12 月 22 日到 3 月 22 日之间的数据，同样对于所有有数据的年份，上述子句计算 3 月 22 日是在下一年的 12 月 22 日，因此相应地匹配日期：

目前日期固定为 1 月 1 日至 12 月 31 日，但将参数化，如上所示。

计划中的 HashAggregate 显示成本为 10006220141.11，我怀疑这是天文数字。

对正在执行的测量表（本身既没有数据也没有索引）进行全表扫描。该表从其子表中聚合了 2.73 亿行。

问题

索引日期以避免全表扫描的正确方法是什么？

我考虑过的选项：

杜松子酒
要旨
重写 WHERE 子句
将 year_taken、month_taken 和 day_taken 列与表分开

你觉得呢？你有没有什么想法？

谢谢！

score 2 · Accepted Answer

您的问题是您有一个 where 子句，具体取决于日期的计算。如果数据库需要获取每一行并在知道日期是否匹配之前对其进行计算，那么数据库就无法使用索引。

除非您将其重写为数据库具有固定范围以检查哪个不依赖于要检索的数据的形式，否则您将始终必须扫描表。

score 1 · Accepted Answer

尝试这样的事情：

create temporary table test (d date);

insert into test select '1970-01-01'::date+generate_series(1,50*365);

analyze test

create function month_day(d date) returns int as $$
  select extract(month from $1)::int*100+extract(day from $1)::int $$
language sql immutable strict;

create index test_d_month_day_idx on test (month_day(d));

explain analyze select * from test
  where month_day(d)>=month_day('2000-04-01')
  and month_day(d)<=month_day('2000-04-05');

score 0 · Accepted Answer

I think to run this efficiently across those partitions I would have your app be alot smarter about the date ranges. Have it generate an actual list of dates to check per partition and then have it generate one query with a UNION between the partitions. It sounds like your data set is pretty static, so a CLUSTER on your date index could greatly improve performance as well.

sql - 优化大型子表的日期查询：GiST 还是 GIN？

3 回答 3

Related

Reference