我有一个 a 和 b 的 data.table,我below
用 b < .5 和above
b > .5 进行了分区:
DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
above = DT[DT$b > .5]
below = DT[DT$b < .5, list(a=a)]
我想在above
and之间做一个左外连接below
:对于每个a
in above
,计算 in 的行数below
。这相当于 SQL 中的以下内容:
with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1),
above as (select a, b from dt where b > .5),
below as (select a, b from dt where b < .5)
select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a;
a | count
---+-------
3 | 0
2 | 1
(2 rows)
我如何用 data.tables 完成同样的事情?这是我到目前为止所尝试的:
> key(below) = 'a'
> below[above, list(count=length(b))]
a count
[1,] 2 1
[2,] 3 1
[3,] 3 1
> below[above, list(count=length(b)), by=a]
Error in eval(expr, envir, enclos) : object 'b' not found
> below[, list(count=length(a)), by=a][above]
a count b
[1,] 2 1 1
[2,] 3 NA 1
[3,] 3 NA 1
我还应该更具体一点,因为我已经尝试过merge
了,但这会破坏我系统上的内存(并且数据集只占用我大约 20% 的内存)。