我们编写了一个包来分析与时间窗口相关的大量事件。
为了进行分析,我们需要建立窗口的一些属性以及它们之间的交叉引用。
这是使用 data.table 的本机语法完成的。一些步骤的示例包含在下面的reprex 中。
我们现在正在寻求使用 dplyr/dtplyr 重新构建这个包以提高可读性并与其他方共享。
虽然我可以用 dplyr 语法编写“查询”,但我没有看到一种将更新应用到基础表的简洁方式——添加列、更新行等,而无需重复创建和替换副本。当数据量很大时,data.table 的“原地更新”特性是非常可取的。有没有办法在 dplyr 语法中利用这一点?(我遇到了 immutable = FALSE 的障碍并尝试使用 rows_update())
library(data.table)
set.seed <- 123
#Create a table of events with timestamp and an event type (501 events randomly generated over the previous 30 days)
DT1 <- data.table(timeStamp = as.POSIXct('2021-03-25') - as.integer(runif(501)*60*1440*30),
eventType=c('A', 'B', 'C'))
setkey(DT1, timeStamp)
print(DT1)
#> timeStamp eventType
#> 1: 2021-02-23 00:42:37 A
#> 2: 2021-02-23 04:21:43 A
#> 3: 2021-02-23 05:23:51 C
#> 4: 2021-02-23 06:45:36 C
#> 5: 2021-02-23 08:34:32 B
#> ---
#> 497: 2021-03-24 11:32:09 A
#> 498: 2021-03-24 13:49:53 B
#> 499: 2021-03-24 14:26:55 C
#> 500: 2021-03-24 18:11:33 C
#> 501: 2021-03-24 20:13:51 A
#Create a table of time windows. One for each date represented with an early and late time for each
#Assign this a class (in this example the value of the most common eventType)
DT2 <- DT1[,keyby=.(date=lubridate::date(timeStamp)),
.(earlyTime = min(timeStamp - 1),
lateTime = max(timeStamp + 1),
as = sum(eventType == 'A'),
bs = sum(eventType == 'B'),
cs = sum(eventType == 'C'))][
,.(date,
earlyTime,
lateTime,
class=ifelse(as >= bs & as >= cs, 'A', ifelse(bs >= cs, 'B', 'C')))]
print(head(DT2))
#> date earlyTime lateTime class
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A
#Give each row in DT2 a row number (so that we can readily cross-reference between rows)
DT2[order(lateTime), rn := .I]
#For each row, get the row number of the previous instance of this class
DT2[order(class, rn), prevOfClass := shift(rn, 1), by=.(class)]
print(head(DT2))
#> date earlyTime lateTime class rn prevOfClass
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13 B 1 NA
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14 B 2 1
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44 A 3 NA
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51 B 4 2
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46 A 5 3
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44 A 6 5
#For each row that is not a 'C' find the previous and next instances of a C type row
#Note that when we assigned rn we ensured that the rows were in ascending time order
#so rn can be used as a proxy for sorting by time
DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))]
#> rn rn nextC prevC
#> 1: 1 1 8 NA
#> 2: 2 2 8 NA
#> 3: 3 3 8 NA
#> 4: 4 4 8 NA
#> 5: 5 5 8 NA
#> 6: 6 6 8 NA
#> 7: 7 7 8 NA
#> 8: 9 9 13 8
#> 9: 10 10 13 8
#> 10: 11 11 13 8
#> 11: 12 12 13 8
#> 12: 14 14 16 13
#> 13: 15 15 16 13
#> 14: 17 17 26 16
#> 15: 18 18 26 16
#> 16: 19 19 26 16
#> 17: 20 20 26 16
#> 18: 21 21 26 16
#> 19: 22 22 26 16
#> 20: 23 23 26 16
#> 21: 24 24 26 16
#> 22: 25 25 26 16
#> 23: 28 28 30 27
#> 24: 29 29 30 27
#> rn rn nextC prevC
#But I want to add this information as additional columns to the base table
DT2[DT2[class=='C'][DT2[class != 'C'],
on=.(rn > rn),
by=.EACHI,
.(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))],
on = .(rn),
':='(nextC=i.nextC, prevC = i.prevC)
]
print(DT2[,.(rn, date, class, prevOfClass, nextC, prevC)])
#> rn date class prevOfClass nextC prevC
#> 1: 1 2021-02-23 B NA 8 NA
#> 2: 2 2021-02-24 B 1 8 NA
#> 3: 3 2021-02-25 A NA 8 NA
#> 4: 4 2021-02-26 B 2 8 NA
#> 5: 5 2021-02-27 A 3 8 NA
#> 6: 6 2021-02-28 A 5 8 NA
#> 7: 7 2021-03-01 A 6 8 NA
#> 8: 8 2021-03-02 C NA NA NA
#> 9: 9 2021-03-03 A 7 13 8
#> 10: 10 2021-03-04 A 9 13 8
#> 11: 11 2021-03-05 B 4 13 8
#> 12: 12 2021-03-06 A 10 13 8
#> 13: 13 2021-03-07 C 8 NA NA
#> 14: 14 2021-03-08 A 12 16 13
#> 15: 15 2021-03-09 B 11 16 13
#> 16: 16 2021-03-10 C 13 NA NA
#> 17: 17 2021-03-11 A 14 26 16
#> 18: 18 2021-03-12 B 15 26 16
#> 19: 19 2021-03-13 A 17 26 16
#> 20: 20 2021-03-14 B 18 26 16
#> 21: 21 2021-03-15 A 19 26 16
#> 22: 22 2021-03-16 A 21 26 16
#> 23: 23 2021-03-17 A 22 26 16
#> 24: 24 2021-03-18 A 23 26 16
#> 25: 25 2021-03-19 B 20 26 16
#> 26: 26 2021-03-20 C 16 NA NA
#> 27: 27 2021-03-21 C 26 NA NA
#> 28: 28 2021-03-22 B 25 30 27
#> 29: 29 2021-03-23 A 24 30 27
#> 30: 30 2021-03-24 C 27 NA NA
#> rn date class prevOfClass nextC prevC
#What would be the best approach to this using dplyr / dtplyr syntax?
#In practice there are many hundreds of thousands of rows in the tables
#and...
#There are many more update and enrichments that need to be applied
#some of which add new columns, others will update just a few rows
#in a column
#So 'mutate in place/by reference' is highly desirable
由reprex 包于 2021-03-25 创建(v1.0.0)