r - 使用 dplyr/dtplyr 对 data.table 进行复杂更新的最佳方法是什么

Question

我们编写了一个包来分析与时间窗口相关的大量事件。

为了进行分析，我们需要建立窗口的一些属性以及它们之间的交叉引用。

这是使用 data.table 的本机语法完成的。一些步骤的示例包含在下面的reprex 中。

我们现在正在寻求使用 dplyr/dtplyr 重新构建这个包以提高可读性并与其他方共享。

虽然我可以用 dplyr 语法编写“查询”，但我没有看到一种将更新应用到基础表的简洁方式——添加列、更新行等，而无需重复创建和替换副本。当数据量很大时，data.table 的“原地更新”特性是非常可取的。有没有办法在 dplyr 语法中利用这一点？（我遇到了 immutable = FALSE 的障碍并尝试使用 rows_update()）

library(data.table)
set.seed <- 123
#Create a table of events with timestamp and an event type (501 events randomly generated over the previous 30 days)
DT1 <- data.table(timeStamp = as.POSIXct('2021-03-25') - as.integer(runif(501)*60*1440*30), 
                  eventType=c('A', 'B', 'C'))
setkey(DT1, timeStamp)
print(DT1)
#>                timeStamp eventType
#>   1: 2021-02-23 00:42:37         A
#>   2: 2021-02-23 04:21:43         A
#>   3: 2021-02-23 05:23:51         C
#>   4: 2021-02-23 06:45:36         C
#>   5: 2021-02-23 08:34:32         B
#>  ---                              
#> 497: 2021-03-24 11:32:09         A
#> 498: 2021-03-24 13:49:53         B
#> 499: 2021-03-24 14:26:55         C
#> 500: 2021-03-24 18:11:33         C
#> 501: 2021-03-24 20:13:51         A
#Create a table of time windows.  One for each date represented with an early and late time for each
#Assign this a class (in this example the value of the most common eventType)
DT2 <- DT1[,keyby=.(date=lubridate::date(timeStamp)),
           .(earlyTime = min(timeStamp - 1), 
             lateTime = max(timeStamp + 1),
             as = sum(eventType == 'A'),
             bs = sum(eventType == 'B'),
             cs = sum(eventType == 'C'))][
               ,.(date, 
                  earlyTime, 
                  lateTime, 
                  class=ifelse(as >= bs & as >= cs, 'A', ifelse(bs >= cs, 'B', 'C')))]
print(head(DT2))
#>          date           earlyTime            lateTime class
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13     B
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14     B
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44     A
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51     B
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46     A
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44     A
#Give each row in DT2 a row number (so that we can readily cross-reference between rows)
DT2[order(lateTime), rn := .I]

#For each row, get the row number of the previous instance of this class
DT2[order(class, rn), prevOfClass := shift(rn, 1), by=.(class)]
print(head(DT2))
#>          date           earlyTime            lateTime class rn prevOfClass
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13     B  1          NA
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14     B  2           1
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44     A  3          NA
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51     B  4           2
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46     A  5           3
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44     A  6           5
#For each row that is not a 'C' find the previous and next instances of a C type row
#Note that when we assigned rn we ensured that the rows were in ascending time order
#so rn can be used as a proxy for sorting by time
DT2[class=='C'][DT2[class != 'C'], 
                on=.(rn > rn), 
                by=.EACHI,
                .(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))]
#>     rn rn nextC prevC
#>  1:  1  1     8    NA
#>  2:  2  2     8    NA
#>  3:  3  3     8    NA
#>  4:  4  4     8    NA
#>  5:  5  5     8    NA
#>  6:  6  6     8    NA
#>  7:  7  7     8    NA
#>  8:  9  9    13     8
#>  9: 10 10    13     8
#> 10: 11 11    13     8
#> 11: 12 12    13     8
#> 12: 14 14    16    13
#> 13: 15 15    16    13
#> 14: 17 17    26    16
#> 15: 18 18    26    16
#> 16: 19 19    26    16
#> 17: 20 20    26    16
#> 18: 21 21    26    16
#> 19: 22 22    26    16
#> 20: 23 23    26    16
#> 21: 24 24    26    16
#> 22: 25 25    26    16
#> 23: 28 28    30    27
#> 24: 29 29    30    27
#>     rn rn nextC prevC

#But I want to add this information as additional columns to the base table
DT2[DT2[class=='C'][DT2[class != 'C'], 
                on=.(rn > rn), 
                by=.EACHI,
                .(rn=i.rn, nextC = min(x.rn), prevC = min(x.prevOfClass))],
    on = .(rn),
    ':='(nextC=i.nextC, prevC = i.prevC)
]
print(DT2[,.(rn, date, class, prevOfClass, nextC, prevC)])
#>     rn       date class prevOfClass nextC prevC
#>  1:  1 2021-02-23     B          NA     8    NA
#>  2:  2 2021-02-24     B           1     8    NA
#>  3:  3 2021-02-25     A          NA     8    NA
#>  4:  4 2021-02-26     B           2     8    NA
#>  5:  5 2021-02-27     A           3     8    NA
#>  6:  6 2021-02-28     A           5     8    NA
#>  7:  7 2021-03-01     A           6     8    NA
#>  8:  8 2021-03-02     C          NA    NA    NA
#>  9:  9 2021-03-03     A           7    13     8
#> 10: 10 2021-03-04     A           9    13     8
#> 11: 11 2021-03-05     B           4    13     8
#> 12: 12 2021-03-06     A          10    13     8
#> 13: 13 2021-03-07     C           8    NA    NA
#> 14: 14 2021-03-08     A          12    16    13
#> 15: 15 2021-03-09     B          11    16    13
#> 16: 16 2021-03-10     C          13    NA    NA
#> 17: 17 2021-03-11     A          14    26    16
#> 18: 18 2021-03-12     B          15    26    16
#> 19: 19 2021-03-13     A          17    26    16
#> 20: 20 2021-03-14     B          18    26    16
#> 21: 21 2021-03-15     A          19    26    16
#> 22: 22 2021-03-16     A          21    26    16
#> 23: 23 2021-03-17     A          22    26    16
#> 24: 24 2021-03-18     A          23    26    16
#> 25: 25 2021-03-19     B          20    26    16
#> 26: 26 2021-03-20     C          16    NA    NA
#> 27: 27 2021-03-21     C          26    NA    NA
#> 28: 28 2021-03-22     B          25    30    27
#> 29: 29 2021-03-23     A          24    30    27
#> 30: 30 2021-03-24     C          27    NA    NA
#>     rn       date class prevOfClass nextC prevC

#What would be the best approach to this using dplyr / dtplyr syntax?
#In practice there are many hundreds of thousands of rows in the tables
#and...
#There are many more update and enrichments that need to be applied
#some of which add new columns, others will update just a few rows
#in a column
#So 'mutate in place/by reference' is highly desirable

^{由reprex 包于 2021-03-25 创建(v1.0.0)}

r - 使用 dplyr/dtplyr 对 data.table 进行复杂更新的最佳方法是什么

0 回答 0

Related

Reference