1

我有一个这样的数据框:

Date     Process Duration
1/1/2012 xnit     10
1/1/2012 xnit     15
1/1/2012 xnit     20
1/2/2012 telnet   80
1/2/2012 telnet   50
1/2/2012 telnet   40
8/1/2012 ftp      3
8/1/2012 ftp      11
8/1/2012 ftp     12

转换为 x<-data.table(x) 后:

我可以这样计算每项工作的平均值:

x<-x[, mean := mean(Duration), by = Process]

我喜欢将特定日期 Duration 的持续时间与平均值进行比较。我试过这个:

x<-x[, Aug1 := subset(x, Date==as.Date(c("2012-08-01")))$Duration, by = Process]

一旦我得到这个值,我将把 Aug1 列与每个进程的平均值进行比较,以查看异常值。但是,此命令需要很长时间才能完成。有一个更好的方法吗?

4

1 回答 1

2

使用时无需重新分配给 x ,:=因为这是通过引用分配给 x (尤其是从默认情况下不会打印的版本 1.8.3 开始)。我也不会使用子集或$data.tables,因为这会避免所有 data.table 效率。——</p>

尝试这样的事情

 x <- data.table(x)
 # add a column that is the by-process mean
 x[, mean_duration := mean(Duration), by = Process]

 # calculate the difference
 x[, diff_duration := Duration - mean_duration]

 # subset just the 1st of august
 x[Date==as.Date("2012-08-01")]

This final subset could be done more efficiently if the data.table was keyed by Date. In the current form this final step is a vector scan, but a single vector scan should not be too inefficient.

I would recommend reading the introduction vignette to better utilize the data.table syntax and efficiency.

于 2012-10-27T03:25:08.697 回答