r - data.table 中的 groupBy：使用第一个值

Question

我data.table在 R 中有一个巨大的包含实验结果的内容：对于每个结果，运行的 id 和配置参数包含在另外两行中。conf每次运行的参数都是恒定的。请参阅此简化示例：

> x=data.table(runId=rep(c(1,2,3,4,5,6),each=5),conf=rep(c(10,10,500,500,1000,1000), each=5), value=runif(30,1, 1000))
> x
   runId conf     value
       1   10 102.17366
       1   10 739.31317
       1   10 361.83867
       1   10 915.05966
       1   10 435.11605
       2   10 254.13930
       2   10 482.93782
       2   10 598.34327
       2   10 401.45823
       2   10 480.17624
       3  500 831.03700
       3  500 378.53013
       3  500 371.75072
       3  500  61.27925
       3  500 425.50863
       4  500 557.64415
       4  500 731.07127
       4  500 836.31104
       4  500 138.61641
       4  500 106.12334
       5 1000 925.24886
       5 1000 840.06707
       5 1000 680.79559
       5 1000 402.77619
       5 1000 507.21966
       6 1000 111.93297
       6 1000 100.88960
       6 1000 149.17332
       6 1000 444.28845
       6 1000 654.86640

我想计算每次运行的值的平均值，我可以使用：

> x[,list(mean=mean(value)),by=runId]
    runId     mean
[1,]     1 634.1549
[2,]     2 275.1270
[3,]     3 328.4098
[4,]     4 584.1364
[5,]     5 616.1647
[6,]     6 411.2354

我还想将conf值添加到聚合中的每一行。事实上，我也可以通过使用列的mean功能来获得这个结果conf。但是：这是没用的，因为每个 runId 的 conf-value 根本不会改变：

> x[,list(conf=mean(conf),mean=mean(value)),by=runId]
     runId conf     mean
[1,]     1   10 634.1549
[2,]     2   10 275.1270
[3,]     3  500 328.4098
[4,]     4  500 584.1364
[5,]     5 1000 616.1647
[6,]     6 1000 411.23

这里有这个hacky mean-function的另一种选择吗？我可以用来聚合的“第一个”函数（或“最后一个”，在这种情况下不介意）之类的东西？

score 1 · Accepted Answer

好的，就在我完成这个问题时，在 IRC 上得到了答案。正如我已经发布了这个问题，也许有人觉得这个有用，尽管结果很明显：

要获得第一个结果，只需使用column[1]. 所以上面的例子归结为：

> x[,list(conf=conf[1], mean=mean(value)), by=runId]
     runId conf     mean
[1,]     1   10 634.1549
[2,]     2   10 275.1270
[3,]     3  500 328.4098
[4,]     4  500 584.1364
[5,]     5 1000 616.1647
[6,]     6 1000 411.23

r - data.table 中的 groupBy：使用第一个值

1 回答 1

Related

Reference