17

假设我有以下内容data.table

dt <- data.table(id = c(rep(1, 5), rep(2, 4)),
                 sex = c(rep("H", 5), rep("F", 4)), 
                 fruit = c("apple", "tomato", "apple", "apple", "orange", "apple", "apple", "tomato", "tomato"),
                 key = "id")

   id sex  fruit
1:  1   H  apple
2:  1   H tomato
3:  1   H  apple
4:  1   H  apple
5:  1   H orange
6:  2   F  apple
7:  2   F  apple
8:  2   F tomato
9:  2   F tomato

id每行代表某人(由它和标识sex)吃了 a的事实fruit。我想数一下每个fruit人被吃掉的次数sex。我可以这样做:

dt[ , .N, by = c("fruit", "sex")]

这使:

    fruit sex N
1:  apple   H 3
2: tomato   H 1
3: orange   H 1
4:  apple   F 2
5: tomato   F 2

问题是,通过这种方式,我失去了orangefor的计数sex == "F",因为这个计数是 0。有没有办法在不丢失零计数组合的情况下进行这种聚合?

非常清楚,期望的结果如下:

   fruit sex N
1:  apple   H 3
2: tomato   H 1
3: orange   H 1
4:  apple   F 2
5: tomato   F 2
6: orange   F 0

非常感谢 !

4

2 回答 2

16

似乎最直接的方法是在传递给的 data.table 中显式提供所有类别组合,并i=设置by=.EACHI对它们进行迭代:

setkey(dt, sex, fruit)
dt[CJ(sex, fruit, unique = TRUE), .N, by = .EACHI]
#    sex  fruit N
# 1:   F  apple 2
# 2:   F orange 0
# 3:   F tomato 2
# 4:   H  apple 3
# 5:   H orange 1
# 6:   H tomato 1
于 2013-05-14T15:40:13.677 回答
11

一种方法是改变sexid考虑因素(id这里是多余的吗?)

dt[, sex := factor(sex)]
dt[, .(sex=levels(sex), N=c(table(sex))), by=fruit]
#     fruit sex N
# 1:  apple   F 2
# 2:  apple   H 3
# 3: tomato   F 2
# 4: tomato   H 1
# 5: orange   F 0
# 6: orange   H 1

或者您可以更改fruit为 factor 和 group by sex

dt[, fruit := factor(fruit)]
dt[, .(fruit = levels(fruit), N=c(table(fruit))),by=sex]
#    sex  fruit N
# 1:   H  apple 3
# 2:   H orange 1
# 3:   H tomato 1
# 4:   F  apple 2
# 5:   F orange 0
# 6:   F tomato 2

编辑:

但我怀疑如果你data.table的规模很大,那么依赖table可能不是一个好主意。在这种情况下,使用CJ您之前的问题可能是要走的路。也就是先做聚合,再做join。

out <- setkey(dt, sex, fruit)[, .N, 
             by="sex,fruit"][CJ(c("H","F"), 
             c("apple","tomato","orange")), 
             allow.cartesian=TRUE][is.na(N), N := 0L]
#    sex  fruit N
# 1:   F  apple 2
# 2:   F orange 0
# 3:   F tomato 2
# 4:   H  apple 3
# 5:   H orange 1
# 6:   H tomato 1
于 2013-05-13T10:15:18.780 回答