sql - 每 15 分钟有效地查询一个巨大的时间序列表中的一行

Question

我有两个表，conttagtable（t）和contfloattable（cf）。T 有大约 43k 行。CF拥有超过90亿。

tagindex我在两个表的列上创建了两个表的索引。可以将此列视为 for 的唯一标识符和 for的conttagtable外键。我没有在与另一个表相关的任何一个表上显式创建 PK 或外键，尽管此数据在逻辑上与两个表上的列相关，就像a和where a一样。数据来自微软访问转储，我不知道我是否可以相信 tagindex 是唯一的，因此不强制执行“唯一性”。conttagtableconfloattabletagindexconttagtable.tagindexPRIMARY KEYcontfloattable.tagindexFOREIGN KEY (tagindex) REFERENCES conttagtable(tagindex)

数据本身非常庞大。

我需要从contfloattable每个. 因此，如果给定的 4000 个样本跨越 30 分钟，我需要一个 0-14 分钟范围内的样本和一个 15-30 分钟范围内的样本。15 分钟范围内的任何一个样品都是可以接受的；第一个，最后一个，随机的，随便什么。contfloattable.dateandtimeconttagtable.tagidcontfloattabletagid

简而言之，我需要每 15 分钟获取一个样本，但每个 t.tagname 只需要一个样本。现在的样本每 5 秒记录一次，数据跨越两年。就 sql 而言，这是一个大数据问题，而且超出了我的想象。我从谷歌搜索或搜索 SO 中尝试的所有时间间隔解决方案都产生了如此长的查询时间，以至于它们不实用。

我的索引是否足以进行快速连接？（它们似乎是在省略时间间隔部分时）
我会从添加任何其他索引中受益吗？
实现上述目标的最佳/最快查询是什么？

这是一个包含架构和一些示例数据的 SQLFiddle：http ://sqlfiddle.com/#!1/c7d2f/2

架构：

        Table "public.conttagtable" (t)
   Column    |  Type   | Modifiers
-------------+---------+-----------
 tagname     | text    |
 tagindex    | integer |
 tagtype     | integer |
 tagdatatype | integer |
Indexes:
    "tagindex" btree (tagindex)


             Table "public.contfloattable" (CF)
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 dateandtime | timestamp without time zone |
 millitm     | integer                     |
 tagindex    | integer                     |
 Val         | double precision            |
 status      | text                        |
 marker      | text                        |
Indexes:
    "tagindex_contfloat" btree (tagindex)

我想看到的输出是这样的：

cf.dateandtime      |cf."Val"|cf.status|t.tagname
--------------------------------------------------
2012-11-16 00:00:02  45       S         SuperAlpha
2012-11-16 00:00:02  45       S         SuperBeta
2012-11-16 00:00:02  45       S         SuperGamma
2012-11-16 00:00:02  45       S         SuperDelta
2012-11-16 00:15:02  45       S         SuperAlpha
2012-11-16 00:15:02  45       S         SuperBeta
2012-11-16 00:15:02  45       S         SuperGamma
2012-11-16 00:15:02  45       S         SuperDelta
2012-11-16 00:30:02  45       S         SuperAlpha
2012-11-16 00:30:02  45       S         SuperBeta
2012-11-16 00:30:02  45       S         SuperGamma
2012-11-16 00:30:02  45       S         SuperDelta
2012-11-16 00:45:02  42       S         SuperAlpha

……等等等等……

正如 Clodoaldo 所建议的，这是我的最新尝试，有什么建议可以加快速度吗？

with i as (
    select cf.tagindex, min(dateandtime) dateandtime
    from contfloattable cf
    group by
        floor(extract(epoch from dateandtime) / 60 / 15),
        cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
    contfloattable cf
    inner join
    conttagtable t on cf.tagindex = t.tagindex
    inner join
    i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by floor(extract(epoch from cf.dateandtime) / 60 / 15), cf.tagindex

从上面查询计划： http: //explain.depesz.com/s/loR

score 2 · Accepted Answer

间隔 15 分钟：

with i as (
    select cf.tagindex, min(dateandtime) dateandtime
    from contfloattable cf
    group by
        floor(extract(epoch from dateandtime) / 60 / 15),
        cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
    contfloattable cf
    inner join
    conttagtable t on cf.tagindex = t.tagindex
    inner join
    i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by cf.dateandtime, t.tagname

显示此查询的解释输出（如果有效），以便我们尝试优化。您可以将其发布在此答案中。

解释输出

"Sort  (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49481978.32 rows=19436288 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270117658.06..1067549320.69 rows=64410251271 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270117116.39..298434544.23 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2741707.02..2790297.74 rows=19436288 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..388725.76 rows=19436288 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

看起来你需要这个索引：

create index cf_tag_datetime on contfloattable (tagindex, dateandtime)

创建后运行analyze。现在请注意，大表上的任何索引都会对数据更改（插入等）产生重大的性能影响，因为它必须在每次更改时更新。

更新

我添加了 cf_tag_datetime 索引 (tagindex,dateandtime)，这是新的解释：

"Sort  (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270179293.86..1078141313.22 rows=65462975340 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

它似乎已经及时上升了:(但是，如果我删除 order by 子句（不完全是我需要的，但会起作用），这就是发生的事情，大大减少：

"Hash Join  (cost=319669581.62..1127631600.98 rows=65462975340 width=57)"
"  Hash Cond: (cf.tagindex = t.tagindex)"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"        Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"        ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"              Sort Key: i.tagindex, i.dateandtime"
"              ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"        ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"              ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                    Sort Key: cf.tagindex, cf.dateandtime"
"                    ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"  ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"        ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

我还没有尝试过这个索引......不过会这样做。支持。

现在再看一遍，我认为逆索引可能会更好，因为它不仅可以在决赛中使用，Merge Join还可以在决赛中使用Sort：

create index cf_tag_datetime on contfloattable (dateandtime, tagindex)

score 1 · Accepted Answer

这是另一种表述。我很想知道它是如何在整个数据集上扩展的。首先创建这个索引：

CREATE INDEX contfloattable_tag_and_timeseg
ON contfloattable(tagindex, (floor(extract(epoch FROM dateandtime) / 60 / 15) ));

然后尽可能多地运行它work_mem：

SELECT 
  (first_value(x) OVER (PARTITION BY x.tagindex, floor(extract(epoch FROM x.dateandtime) / 60 / 15))).*,
  (SELECT t.tagname FROM conttagtable t WHERE t.tagindex = x.tagindex) AS tagname
FROM contfloattable x ORDER BY dateandtime, tagname;

Sneaky Wombat：从上面的 sql 解释完整数据集（没有建议的索引）：http ://explain.depesz.com/s/kGo

或者，这里只需要一个顺序传递，contfloattable将值收集到一个元组存储中，然后对它进行JOIN编辑以获取标签名称。它需要很多work_mem：

SELECT cf.dateandtime, cf.dataVal, cf.status, t.tagname
FROM 
  (
    SELECT (first_value(x) OVER (PARTITION BY x.tagindex, floor(extract(epoch FROM x.dateandtime) / 60 / 15))).*
    FROM contfloattable x
  ) cf
  INNER JOIN
  conttagtable t ON cf.tagindex = t.tagindex
ORDER BY cf.dateandtime, t.tagname;

Sneaky Wombat：从上面的 sql 解释完整数据集（没有建议的索引）：http ://explain.depesz.com/s/57q

如果它有效，您将希望work_mem在查询中尽可能多地投入。你还没有提到你系统的 RAM，但你会想要相当大的一块；尝试：

SET work_mem = '500MB';

... 或更多，如果您有至少 4GB 的 RAM 并且在 64 位 CPU 上。同样，我真的很想看看它是如何在完整数据集上工作的。

顺便说一句，为了这些查询的正确性，我建议您ALTER TABLE conttagtable ADD PRIMARY KEY (tagindex);这样做DROP INDEX t_tagindex;。这将需要一些时间，因为它将建立一个唯一索引。这里提到的大多数查询都假设在中t.tagindex是唯一的conttagtable，并且确实应该强制执行。唯一索引可用于旧的非唯一索引无法进行的其他优化t_tagindex，并且它可以产生更好的统计估计。

Also, when comparing query plans, note that cost isn't necessarily strictly proportional to real-world execution time. If the estimates are good then it should roughly correlate, but the estimates are only that. Sometimes you'll see a high-cost plan execute faster than a supposedly low-cost plan due to things like bad rowcount estimates or index selectivity estimates, limitations in the query planner's ability to infer relationships, unexpected correlations, or cost parameters like random_page_cost and seq_page_cost that don't match the real system.

sql - 每 15 分钟有效地查询一个巨大的时间序列表中的一行

2 回答 2

Related

Reference