r - 将平均值与 R 中的特定数据值进行比较

Question

我有一个这样的数据框：

Date     Process Duration
1/1/2012 xnit     10
1/1/2012 xnit     15
1/1/2012 xnit     20
1/2/2012 telnet   80
1/2/2012 telnet   50
1/2/2012 telnet   40
8/1/2012 ftp      3
8/1/2012 ftp      11
8/1/2012 ftp     12

转换为 x<-data.table(x) 后：

我可以这样计算每项工作的平均值：

x<-x[, mean := mean(Duration), by = Process]

我喜欢将特定日期 Duration 的持续时间与平均值进行比较。我试过这个：

x<-x[, Aug1 := subset(x, Date==as.Date(c("2012-08-01")))$Duration, by = Process]

一旦我得到这个值，我将把 Aug1 列与每个进程的平均值进行比较，以查看异常值。但是，此命令需要很长时间才能完成。有一个更好的方法吗？

score 2 · Accepted Answer

使用时无需重新分配给 x ，:=因为这是通过引用分配给 x （尤其是从默认情况下不会打印的版本 1.8.3 开始）。我也不会使用子集或$data.tables，因为这会避免所有 data.table 效率。——</p>

尝试这样的事情

 x <- data.table(x)
 # add a column that is the by-process mean
 x[, mean_duration := mean(Duration), by = Process]

 # calculate the difference
 x[, diff_duration := Duration - mean_duration]

 # subset just the 1st of august
 x[Date==as.Date("2012-08-01")]

This final subset could be done more efficiently if the data.table was keyed by Date. In the current form this final step is a vector scan, but a single vector scan should not be too inefficient.

I would recommend reading the introduction vignette to better utilize the data.table syntax and efficiency.

r - 将平均值与 R 中的特定数据值进行比较

1 回答 1

Related

Reference