0

我想使用 TimescaleDB 直接从存储在 PostgreSQL-DB 中的数据中删除峰值。

我的数据存储为间隔为 1 秒的值,我想计算 5 分钟的平均值而没有峰值。

我使用标准偏差确定峰值,并排除所有超过固定 zscore 的数据。

因此,在第一步中,我获取与我的分析相关的所有数据(data_filtered),然后计算每个 5 分钟块的平均值和标准偏差(avg_and_stddev_per_interval),然后将初始数据(data_filtered)与计算的 avg 和 stddev 连接起来,排除所有不符合我的标准的值,最后计算没有峰值的最终 5 分钟平均值。

with data_filtered as (
    select ts, value
    from schema.table 
    where some_criteria = 42 
    and ts >= '2018-11-12 10:00:00'
    and ts < '2018-11-13 10:00:00'
), 
avg_and_stddev_per_interval as (
    select time_bucket('5 minutes', ts) as five_min,
    avg(value) as avg_value,
    stddev(value) as stddev_value,
    from data_filtered
    group by five_min   
)
select 
    time_bucket('5 minutes', ts) as tb,
    avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval 
    on data_filtered.ts >= avg_and_stddev_per_interval.five_min 
    and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
    where abs((value-avg_value)/stddev_value) < 1 
    group by tb;

这一切都很好,但速度非常慢。在没有任何分组 ( select * from data_filtered) 的情况下请求完整数据并在本地计算我的标准要快得多。但是,我想减少数据量,因此在这种情况下这种方法是不可能的。

有什么方法可以加速我的查询?

4

3 回答 3

0

看起来最糟糕的性能发生在 JOIN 中(根据您的答案中的查询,而不是您的问题)。理想情况下,当它返回大量结果时,您不会加入子查询,但我不知道根据您的标准如何避免它。

所以这是我的建议:

  1. 子查询结果放入临时表
  2. 临时表已编入索引
  3. 在临时表上执行连接
  4. 将所有这些封装在一个函数中

现在我通常讨厌这样做,因为我不喜欢创建临时表,但有时它确实为您提供了无法以任何其他方式完成的最佳性能。(并不是说它不能以其他方式完成,但我想不出更好的性能方式。)

所以是这样的:

CREATE OR REPLACE FUNCTION schema.my_function()
    RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
    CREATE TEMP TABLE fm ON COMMIT DROP AS
        select time_bucket('5 minutes', ts) as five_min,
            avg(value) as value,
            stddev(value) as stddev_value
        from schema.table
        where some_criteria = 42
        and ts >= '2018-11-12 00:00:00'
        and ts < '2018-11-13 00:00:00'
        group by five_min;

    CREATE INDEX ON fm (five_min);

    RETURN time_bucket('5 minutes', ts), avg(value)
    from schema.table
    left join fm
        on ts >= fm.five_min 
        and ts < fm.five_min + interval '5 minutes'
    where some_criteria = 42
    and ts >= '2018-11-12 00:00:00'
    and ts < '2018-11-13 00:00:00'
    and abs((value-avg_value)/stddev_value) < 1     
    group by tb;
END
$BODY$
    LANGUAGE plpgsql;

显然,我创建的索引只是基于您发布的查询中的示例,尽管我看到实际查询包含其他内容,因此您希望索引任何正在加入的字段。

我称 tb 的类型为 SOMETYPE,因为我不知道 time_bucket 返回什么类型。当然,您可以将查询的任何应该可变的部分作为参数传递。

于 2018-11-13T16:48:30.720 回答
0

最简单的方法是用(临时)视图替换 CTE 部分。这将允许优化器洗牌和重新组装查询部分。


CREATE TEMP VIEW data_filtered as
    SELECT ts, value
    FROM schema.table
    WHERE some_criteria = 42
    AND ts >= '2018-11-12 10:00:00'
    AND ts < '2018-11-13 10:00:00'
        ;

CREATE TEMP VIEW avg_and_stddev_per_interval as
    SELECT time_bucket('5 minutes', ts) as five_min
    , avg(value) as avg_value
    , stddev(value) as stddev_value
    FROM data_filtered
    GROUP BY 1
        ;

SELECT
    time_bucket('5 minutes', ts) as tb
    , avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval  av
    ON df.ts >= av.five_min
    AND df.ts < av.five_min + interval '5 minutes'
    WHERE abs((value-avg_value)/stddev_value) < 1
    GROUP BY 1
        ;
于 2018-11-13T14:44:22.307 回答
0

eurotrash 的评论导致代码更快,如下所示:

select 
    time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table   
left join (
    select time_bucket('5 minutes', ts) as five_min,
        avg(value) as value,
        stddev(value) as stddev_value,
        from schema.table
        where some_criteria = 42
        and ts >= '2018-11-12 00:00:00'
        and ts < '2018-11-13 00:00:00'
        group by five_min
    ) as fm
    on ts >= fm.five_min 
    and ts < fm.five_min + interval '5 minutes'         
where some_criteria = 42
    and ts >= '2018-11-12 00:00:00'
    and ts < '2018-11-13 00:00:00'
    and abs((value-avg_value)/stddev_value) < 1     
group by tb;

在这里,我摆脱了任何仅出于可读性目的的 CTE。

不过,这仍然比仅请求平均值而不去除尖峰慢 8 倍。

解释分析:

Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
  Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
  Sort Method: quicksort  Memory: 65kB
  ->  HashAggregate  (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
        Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
        ->  Nested Loop  (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
              Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
              Rows Removed by Join Filter: 24770914
              ->  Append  (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
                    ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
                          Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
                    ->  Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk  (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
                          Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
                          Filter: (node_id = 8)
                          Rows Removed by Filter: 949135
              ->  Materialize  (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
                    ->  Subquery Scan on fm  (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
                          ->  Finalize GroupAggregate  (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
                                Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
                                ->  Sort  (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
                                      Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
                                      Sort Method: quicksort  Memory: 251kB
                                      ->  Gather  (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
                                            Workers Planned: 2
                                            Workers Launched: 2
                                            ->  Partial HashAggregate  (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
                                                  Group Key: time_bucket('00:05:00'::interval, data_1.ts)
                                                  ->  Result  (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
                                                        ->  Append  (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
                                                              ->  Parallel Seq Scan on data data_1  (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
                                                                    Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
                                                              ->  Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1  (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
                                                                    Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
                                                                    Filter: (node_id = 8)
                                                                    Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms
于 2018-11-13T13:29:30.543 回答