这是一个适用于时间单位的解决方案,double
以及一个适用于时间单位的更简单的解决方案integer
。我在 10,000 条记录上测试了该double
解决方案,并在我 2015 年的笔记本电脑上立即执行。我无法对 40 GB 数据的性能做出任何保证。
如果您想概括此代码,我会查看RcppRoll 包并学习如何在 R 中实现 c++ 代码。
double
时间单位的解决方案
我把它分解成两个问题。首先,通过回顾至少 5 分钟(或用完数据)来计算窗口大小。其次,取当前观察到回溯单元的距离和时间之和。
R 中的错误循环代码通常会尝试“增长”一个向量,预先分配向量长度然后更改其中的元素会大大提高效率。
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
var_window_cumsum <- function(input, MIN_TIME) {
if(is.null(input$time) | is.null(input$dist)) {
stop("input must have variables time and dist that record the row's duration and distance traveled.")
}
n <- nrow(input)
# First, figure out how far we need to look back to, this vector will store
# the position of the first record that gets our target record up to 5 min or
# more. If we cant look back to 5 min, we leave it as NA.
time_indx = rep(NA_integer_, length = n) # always preallocate your vector!
for(time in (1:n)) {
prior = time # start at self in case observation is already >= MIN_TIME
while(sum(input$time[time:prior]) < MIN_TIME & prior > 1) {
prior = prior - 1
}
# if we cant look back to our minimum time, leave the indx as NA
if (sum(input$time[time:prior]) >= MIN_TIME) {
time_indx[time] = prior
}
}
# Now that we know how far to look back, its easy to find out the total distance
# and total time.
dist5 = rep(NA_integer_, n)
time5 = rep(NA_integer_, n)
for (i in 1:n) {
dist5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$dist[i:time_indx[i]]),
NA)
time5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$time[i:time_indx[i]]),
NA)
}
cbind(input,
window_dist = dist5,
window_time = time5,
window_start = time_indx)
}
# output looks good
# Warning: example data does not include exhaustive cases
# I have not setup thorough testing
var_window_cumsum(input, 5)
# Test on a larger dataset, 10k records
set.seed(1234)
n <- 10000
med_input <- data.frame(
dist = sample(1:5, n, replace = TRUE),
time = sample(1:60, n, replace = TRUE) / 10
)
# you should inspect this to make sure there are no errors
med_output <- var_window_cumsum(med_input, 5)
integer
时间单位的解决方案
如果您的时间单位是整数并且您的数据不是太大,它可能适用于complete
您的数据集。这有点小技巧,但在这里我创建了一个timeid
从开始时间到最大时间的连续变量,并为每个整数时间单位创建一行。从那里很容易计算最后五个时间单位的滚动累积和。最后,我们摆脱了我们添加的所有假行(您要确保这样做,因为它们将具有无效的累积总和数据。另外,重要的是要注意我使用roll_sumr
而不是roll_sum
;roll_sumr
在左侧包括 4 个填充 NA前 4 个单元的输出向量。
library(tidyverse)
library(RcppRoll)
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
desired_dist5 <- c(NA, NA, NA, 10, 5, 9, 10)
desired_time5 <- c(NA, NA, NA, 6, 5, 6, 7)
output <- input %>%
mutate(timeid = cumsum(time),
realrow = TRUE) %>%
complete(timeid = 1:max(timeid)) %>%
mutate(dist5 = roll_sumr(dist, 5, na.rm = T),
time5 = roll_sumr(time, 5, na.rm = T)) %>%
filter(realrow) %>%
select(-c(realrow, timeid))
# Check against example table
output$dist5 == desired_dist5
output$time5 == desired_time5