r - 如何根据 R 中的日期时间列对数据框进行子采样

Question

我想从日期时间列中每隔一小时对数据框进行二次采样，从数据框第一行中的时间值开始。我的数据框从第一行到最后一行每隔 10 分钟运行一次。示例数据如下：

structure(list(datetime = structure(1:19, .Label = c("30/03/2011 05:09", 
"30/03/2011 05:19", "30/03/2011 05:29", "30/03/2011 05:39", "30/03/2011 05:49", 
"30/03/2011 05:59", "30/03/2011 06:09", "30/03/2011 06:19", "30/03/2011 06:29", 
"30/03/2011 06:39", "30/03/2011 06:49", "30/03/2011 06:59", "30/03/2011 07:09", 
"30/03/2011 07:19", "30/03/2011 07:29", "30/03/2011 07:39", "30/03/2011 07:49", 
"30/03/2011 07:59", "30/03/2011 08:09"), class = "factor"), a_count = c(66L, 
34L, 33L, 20L, 12L, 44L, 36L, 29L, 21L, 22L, 17L, 38L, 24L, 19L, 
60L, 54L, 27L, 36L, 45L), b_count = c(166.49, 167.54, 168.31, 
168.81, 169.24, 169.61, 169.96, 170.29, 170.63, 170.98, 171.31, 
171.62, 171.94, 172.29, 172.68, 173.15, 173.71, 174.34, 174.99
)), .Names = c("datetime", "a_count", "b_count"), class = "data.frame", row.names = c(NA, 
-19L))

df

           datetime a_count b_count
1  30/09/2011 05:09      66  166.49
2  30/09/2011 05:19      34  167.54
3  30/09/2011 05:29      33  168.31
4  30/09/2011 05:39      20  168.81
5  30/09/2011 05:49      12  169.24
6  30/09/2011 05:59      44  169.61
7  30/09/2011 06:09      36  169.96
8  30/09/2011 06:19      29  170.29
9  30/09/2011 06:29      21  170.63
10 30/09/2011 06:39      22  170.98
11 30/09/2011 06:49      17  171.31
12 30/09/2011 06:59      38  171.62
13 30/09/2011 07:09      24  171.94
14 30/09/2011 07:19      19  172.29
15 30/09/2011 07:29      60  172.68
16 30/09/2011 07:39      54  173.15
17 30/09/2011 07:49      27  173.71
18 30/09/2011 07:59      36  174.34
19 30/09/2011 08:09      45  174.99

我想最终得到以下数据框：

        datetime   a_count b_count
30/09/2011 05:09       66  166.49
30/09/2011 06:09       36  169.96
30/09/2011 07:09       24  171.94
30/09/2011 08:09       45  174.99

任何建议将不胜感激！

score 5 · Accepted Answer

很难猜出你有什么结构。是否保证您在第一个时间值 + x 次 60 分钟时有一个值？如果找不到值会怎样？如果你当时有两个值会发生什么。你需要近似匹配吗？比如说，09:10 算作 09:09？

让您入门的想法如下：

# I will call your dataframe `d`. 
# Transform datetime to a POSIXct object, R's datatype for timestamps
d$datetime <- as.POSIXct(as.character(d$datetime), format='%d/%m/%Y %H:%M')
# Extract the minutes
d$minute <- as.numeric(format(d$datetime, '%M'))
# And select by identical minute.
subset(d, minute == d$minute[1])

score 3 · Accepted Answer

> df$datetime <- strptime(df$datetime, format = "%d/%m/%Y %H:%M")                                                                                                                                                                           
> df$dif <- c(0, cumsum(as.numeric(diff(df$datetime))))                                                                                                                                                                                     
>                                                                                                                                                                                                                                           
> df[df$dif %% 60 == 0,]                                                                                                                                                                                                              

               datetime a_count b_count dif
  2011-03-30 05:09:00      66  166.49   0
  2011-03-30 06:09:00      36  169.96  60
  2011-03-30 07:09:00      24  171.94 120
  2011-03-30 08:09:00      45  174.99 180

我和 Thilo 有同样的问题，但这是另一种解决方案。

score 1 · Accepted Answer

您还可以使用 lubridate 包来更改时间格式，这可能更直观且易于记忆。

此外，您可以根据小时添加变量，然后总结您对 plyr 的期望。

在下面的示例中，我取了 a_count 的总和和平均值。可能需要根据您的目的而有所不同。

library(plyr)
library(lubridate)

df2 <- mutate(df, dt = dmy_hm(as.character(datetime)), hour = hour(dt), minute = minute(dt))
summary <- ddply(df2, .(hour), summarize, a_mean = mean(a_count), a_sum = sum(a_count))

r - 如何根据 R 中的日期时间列对数据框进行子采样

3 回答 3

Related

Reference