0

lets say we have the following.:

time=c(20060200,20060200,20060200,20060200,20060200,20060300,20060400,20060400,20060400)
bucket=c(1,1,2,2,1,3,3,3,1)
rate=c(0.05,0.04,0.04,0.05,0.06,0.01,0.07,0.08,0.03)




       time bucket rate
1: 20060200      1 0.05
2: 20060200      1 0.04
3: 20060200      2 0.04
4: 20060200      2 0.05
5: 20060200      1 0.06
6: 20060300      3 0.01
7: 20060400      3 0.07
8: 20060400      3 0.08
9: 20060400      1 0.03

i know how to aggregate the rate to time or bucket by something like this

test=data.table(time,bucket,rate)
b=test[,list(x=sum(rate)),by=bucket]

my question is how to aggregate to the bucket, while keeping the time intact.
so what i want is something like this:

20060200  1  0.15
20060200  2  0.09
20060200  3  0
20060300  1  0
20060300  2  0
20060300  3  0.01 
20060400  1  0.03
20060400  2  0
20060400  3  0.15

hope this is clear, thanks

4

2 回答 2

5

正如@Mittenchops 所说,您正在寻找笛卡尔积。有一个功能,CJ. 你可以得到你想要的组合unique(CJ(time,bucket))。要将其与您的 data.table 一起使用,您可以 (i) 设置密钥并 (ii) 将其与 CJ 连接:

setkey(test,time,bucket)
b <- test[unique(CJ(time,bucket)),list(x=sum(rate))]
b[is.na(x),x:=0]

最后一步将缺失值设置为 0。结果是:

       time bucket    x
1: 20060200      1 0.15
2: 20060200      2 0.09
3: 20060200      3 0.00
4: 20060300      1 0.00
5: 20060300      2 0.00
6: 20060300      3 0.01
7: 20060400      1 0.03
8: 20060400      2 0.00
9: 20060400      3 0.15

顺便说一句,当您使用x[y,...]语法“加入”时(其中 x 和 y 都是 data.tables),在byx 的键上(可能只有第一部分)有一个隐藏的 ...a by-without-by... . 在文档或谷歌上查找“by-without-by”以获取详细信息。

于 2013-08-27T17:48:49.427 回答
0

听起来让你的问题变得困难的事情不是关于聚合,而是更多关于按组创建时间的笛卡尔积以填补聚合留下的空白。如果函数中有一个标志来实现这一点,那就太好了,但似乎没有。

所以,这并不优雅,但这里有一个解决方案,通过构建该结构,然后将聚合结果嫁接到该脚手架上:

df <- aggregate(rate~., data=test, sum)
> df
      time bucket rate
1 20060200      1 0.15
2 20060400      1 0.03
3 20060200      2 0.09
4 20060300      3 0.01
5 20060400      3 0.15

找出我们需要什么级别来创建我们的笛卡尔脚手架,在这种情况下,所有组的所有时间:

> levels(factor(bucket))
[1] "1" "2" "3"
> levels(factor(time))
[1] "20060200" "20060300" "20060400"
> B <- levels(factor(bucket))
> t <- levels(factor(time))

制作一个格子基础,将结果移植到:

> base <- expand.grid(B,t)
> names(base) <-c("bucket","time")
> base
  bucket     time
1      1 20060200
2      2 20060200
3      3 20060200
4      1 20060300
5      2 20060300
6      3 20060300
7      1 20060400
8      2 20060400
9      3 20060400

将数据框合并到基础上:

> m <- merge(base,df,all.x=T)
  bucket     time rate
1      1 20060200 0.15
2      1 20060300   NA
3      1 20060400 0.03
4      2 20060200 0.09
5      2 20060300   NA
6      2 20060400   NA
7      3 20060200   NA
8      3 20060300 0.01
9      3 20060400 0.15

将 NA 替换为 0:

m$rate[is.na(m$rate)] <- 0
> m
  bucket     time rate
1      1 20060200 0.15
2      1 20060300 0.00
3      1 20060400 0.03
4      2 20060200 0.09
5      2 20060300 0.00
6      2 20060400 0.00
7      3 20060200 0.00
8      3 20060300 0.01
9      3 20060400 0.15

排序以获得所需的输出:

> m[with(m,order(time,bucket)),]
  bucket     time rate
1      1 20060200 0.15
4      2 20060200 0.09
7      3 20060200 0.00
2      1 20060300 0.00
5      2 20060300 0.00
8      3 20060300 0.01
3      1 20060400 0.03
6      2 20060400 0.00
9      3 20060400 0.15
于 2013-08-27T17:24:28.680 回答