sql - 在 SQL 中通过日期和时间执行聚合

Question

我有一个数据集，其中包含数周的观察结果，频率为 2 分钟。我想将时间间隔从 2 分钟增加到 5 分钟。问题是，观察的频率并不总是相同的。我的意思是，理论上，每 10 分钟应该有 5 次观察，但通常情况并非如此。请让我知道如何根据平均函数以及观察的时间和日期汇总观察结果。换句话说，基于每 5 分钟的聚合，而每 5 分钟时间间隔的观察次数不同。此外，我有时间戳格式的日期和时间。

示例数据：

1 2007-09-14 22:56:12 5.39
2 2007-09-14 22:58:12 5.34
3 2007-09-14 23:00:12 5.16
4 2007-09-14 23:02:12 5.54
5 2007-09-14 23:04:12 5.30
6 2007-09-14 23:06:12 5.20

预期成绩：

1 2007-09-14 23:00 5.29
2 2007-09-14 23:05 5.34

score 9 · Accepted Answer

这个问题的答案可能为您的问题提供了很好的解决方案，展示了有效地将数据聚合到时间窗口中的方法。

本质上，将avg聚合用于：

GROUP BY floor(extract(epoch from the_timestamp) / 60 / 5)

score 3 · Accepted Answer

编辑：我对此做了更多的思考，并意识到你不能只从 2 分钟到 5 分钟。它不加起来。我会跟进，但是一旦你有一些 1 分钟的数据要聚合，下面的代码就可以工作了！

--

如果数据是“开始”格式，您可以使用此函数中的代码，或在数据库中创建该函数以便于访问：

CREATE OR REPLACE FUNCTION dev.beginning_datetime_floor(timestamp without time zone,   
integer)  /* switch out 'dev' with your schema name */
RETURNS timestamp without time zone AS
$BODY$ 
SELECT
date_trunc('minute',timestamp with time zone 'epoch' + 
 floor(extract(epoch from $1)/($2*60))*$2*60
 * interval '1 second') at time zone 'CST6CDT' /* change this to your time zone */
$BODY$
LANGUAGE sql VOLATILE;

你只需输入你想要聚合的整数分钟数（使用 1、2、3、4、5、6、10、12、15、20 或 30），这里有几个结果：

select dev.beginning_datetime_floor('2012-01-01 02:02:21',2)

= '2012-01-01 02:02:00'

select dev.beginning_datetime_floor('2012-01-01 02:02:21',5)

= '2012-01-01 02:00:00'

只需对其进行测试并使用内置时间戳功能添加或减少处理开始和结束时间戳的时间。

当你得到你想要的时间戳时，按照克雷格所说的去做，然后对那个时间戳进行 GROUP BY，并结合你想要的聚合函数（可能是平均值）。

您可以使用以下方法对其进行测试/调整：

date_trunc('minute',timestamp with time zone 'epoch' + 
 floor(extract(epoch from your_datetime)/(interval_minutes*60))*interval_minutes*60
 * interval '1 second') at time zone 'CST6CDT' /* change this to your time zone */

结果可能是您想要平均时间戳 - 例如，如果您的间隔持续时间是不稳定的。为此，您可以创建一个类似的函数来舍入时间戳而不是占用地板。

score 1 · Accepted Answer

到目前为止，最简单的选择是创建一个参考表。在该表中，您存储您感兴趣的时间间隔：

（将此适应您自己的 RDBMS 的日期表示法。）

CREATE TABLE interval (
  start_time    DATETIME,
  cease_time    DATETIME
);
INSERT INTO interval SELECT '2012-10-22 12:00', '2012-10-22 12:05';
INSERT INTO interval SELECT '2012-10-22 12:05', '2012-10-22 12:10';
INSERT INTO interval SELECT '2012-10-22 12:10', '2012-10-22 12:15';
INSERT INTO interval SELECT '2012-10-22 12:15', '2012-10-22 12:20';
INSERT INTO interval SELECT '2012-10-22 12:20', '2012-10-22 12:25';
INSERT INTO interval SELECT '2012-10-22 12:25', '2012-10-22 12:30';
INSERT INTO interval SELECT '2012-10-22 12:30', '2012-10-22 12:35';
INSERT INTO interval SELECT '2012-10-22 12:35', '2012-10-22 12:40';

然后你只需加入并聚合......

SELECT
  interval.start_time,
  AVG(observation.value)
FROM
  interval
LEFT JOIN
  observation
    ON  observation.timestamp >= interval.start_time
    AND observation.timestamp <  interval.cease_time
GROUP BY
  interval.start_time

注意：您只需要创建和填充间隔表一次，然后您可以多次重复使用它。

score 1 · Accepted Answer

好的，所以这只是处理此问题的一种方法。我希望这能让您思考如何转换数据以满足您的分析需求。

测试此代码有一个先决条件。您需要有一个包含所有可能的 1 分钟时间戳的表。有很多方法可以解决这个问题，我只使用我可用的，这是一张桌子：dim_time，每分钟 (00:01:00) 到 (23:59:00) 和另一张桌子都有可能日期（dim_date）。当您加入这些 (1=1) 时，您将获得所有可能日期的所有可能分钟数。

--first you need to create some functions I'll use later
--credit to this first function goes to David Walling
CREATE OR REPLACE FUNCTION dev.beginning_datetime_floor(timestamp without time zone, integer)
  RETURNS timestamp without time zone AS
$BODY$ 
SELECT
date_trunc('minute',timestamp with time zone 'epoch' + 
    floor(extract(epoch from $1)/($2*60))*$2*60
* interval '1 second') at time zone 'CST6CDT'
$BODY$
  LANGUAGE sql VOLATILE;

--the following function is what I described on my previous post  
CREATE OR REPLACE FUNCTION dev.round_minutes(timestamp without time zone, integer)
  RETURNS timestamp without time zone AS
$BODY$ 
  SELECT date_trunc('hour', $1) + cast(($2::varchar||' min') as interval) * round(date_part('minute',$1)::float / cast($2 as float)) 
$BODY$
  LANGUAGE sql VOLATILE;

--let's load the data into a temp table, I added some data points. note: i got rid of the partial seconds
SELECT cast(timestamp_original as timestamp) as timestamp_original, datapoint INTO TEMPORARY TABLE timestamps_second2
FROM
(
SELECT '2007-09-14 22:56:12' as timestamp_original, 0 as datapoint
UNION
SELECT '2007-09-14 22:58:12' as timestamp_original, 1 as datapoint
UNION
SELECT '2007-09-14 23:00:12' as timestamp_original, 10 as datapoint 
UNION
SELECT '2007-09-14 23:02:12' as timestamp_original, 100 as datapoint
UNION
SELECT '2007-09-14 23:04:12' as timestamp_original, 1000 as datapoint
UNION
SELECT '2007-09-14 23:06:12' as timestamp_original, 10000 as datapoint
) as data

--this is the bit of code you'll have to replace with your implementation of getting all possible minutes
--you could make some sequence of timestamps in R, or simply make the timestamps in Excel to test out the rest of the code
--the result of the query is simply '2007-09-14 00:00:00' through '2007-09-14 23:59:00'
SELECT * INTO TEMPORARY TABLE possible_timestamps
FROM
(
select the_date + beginning_minute as minute_timestamp
FROM datawarehouse.dim_date as dim_date
JOIN datawarehouse.dim_time as dim_time
ON 1=1
where dim_date.the_date = '2007-09-14'
group by the_date, beginning_minute
order by the_date, beginning_minute
) as data

--round to nearest minute (be sure to think about how this might change your results
SELECT * INTO TEMPORARY TABLE rounded_timestamps2
FROM
(
SELECT dev.round_minutes(timestamp_original,1) as minute_timestamp_rounded, datapoint
from timestamps_second2
) as data

--let's join what minutes we have data for versus the possible minutes
--I used some subqueries so when you select all from the table you'll see the important part (not needed)
SELECT * INTO TEMPORARY TABLE joined_with_possibles
FROM
(
SELECT *
FROM
(
SELECT *, (MIN(minute_timestamp_rounded) OVER ()) as min_time, (MAX(minute_timestamp_rounded) OVER ()) as max_time
FROM possible_timestamps as t1
LEFT JOIN rounded_timestamps2 as t2
ON t1.minute_timestamp = t2.minute_timestamp_rounded
ORDER BY t1.minute_timestamp asc
) as inner_query
WHERE minute_timestamp >= min_time
AND minute_timestamp <= max_time
) as data

--here's the tricky part that might not suit your needs, but it's one method
--if it's missing a value it grabs the previous value
--if it's missing the prior value it grabs the one before that, otherwise it's null
--best practice would be run another case statement with 0,1,2 specifying which point was pulled, then you can count those when you aggregate
SELECT * INTO TEMPORARY TABLE shifted_values
FROM
(
SELECT 
*,
case 
when datapoint is not null then datapoint
when datapoint is null and (lag(datapoint,1) over (order by minute_timestamp asc)) is not null
  then lag(datapoint,1) over (order by minute_timestamp asc)
when datapoint is null and (lag(datapoint,1) over (order by minute_timestamp asc)) is null and (lag(datapoint,2) over (order by minute_timestamp asc)) is not null
  then lag(datapoint,2) over (order by minute_timestamp asc)
else null end as last_good_value
from joined_with_possibles
ORDER BY minute_timestamp asc
) as data

--now we use the function from my previous post to make the timestamps to aggregate on
SELECT * INTO TEMPORARY TABLE shifted_values_with_five_minute
FROM
(
SELECT *, dev.beginning_datetime_floor(minute_timestamp,5) as five_minute_timestamp
FROM shifted_values
) as data

--finally we aggregate
SELECT
AVG(datapoint) as avg_datapoint, five_minute_timestamp
FROM shifted_values_with_five_minute
GROUP BY five_minute_timestamp

sql - 在 SQL 中通过日期和时间执行聚合

4 回答 4

Related

Reference