我非常怀疑我是否以最有效的方式这样做,这就是我plpgsql
在这里标记的原因。对于一千个测量系统,我需要在20 亿行上运行它。
您的测量系统通常会在失去连接时报告先前的值,并且它们经常会因为突然但有时很长时间而失去连接。您需要汇总,但是当您这样做时,您需要查看它重复了多长时间并根据该信息制作各种过滤器。假设您正在测量汽车的 mpg,但它停留在 20 mpg 一个小时,然后移动到 20.1,依此类推。您需要在卡住时评估准确性。您还可以放置一些替代规则来查找汽车何时在高速公路上行驶,并且通过窗口功能,您可以生成汽车的“状态”并进行分组。无需再费周折:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
那么,为了在一个巨大的桌子上运行它,你会做些什么不同的事情,或者你会使用什么替代工具?我正在考虑 plpgsql,因为我怀疑这需要在数据库中或在数据插入过程中完成,尽管我通常在加载数据后处理数据。有没有办法在不诉诸子查询的情况下一次性完成?
我已经测试了一种替代方法,但它仍然依赖于子查询,我认为这更快。对于该方法,您可以使用 start_timestamp、end_timestamp、system 创建一个“开始和停止”表。然后加入更大的表,如果时间戳介于两者之间,则将其归类为处于该状态,这本质上是cumlative_sum_of_nonrepeats_by_system
. 但是,当您这样做时,您会以 1=1 的方式加入数千台设备和数千或数百万个“事件”。你认为这是一个更好的方法吗?