r - data.table and table unexpected behavior

Question

The data comes from another question I was playing around with:

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")
#    user country event
#1:     3       1     1
#2:     3       1     2
#3:     3       1     3
#4:     3       1     4
#5:     3       2     5
#6:     4       2     6
#7:     4       2     7
#8:     4       2     8
#9:     4       2     9
#10:    4       2    10

And here's the surprising behavior:

dt[user == 3, as.data.frame(table(country))]
#  country Freq
#1       1    4
#2       2    1

dt[user == 4, as.data.frame(table(country))]
#  country Freq
#1       2    5

dt[, as.data.frame(table(country)), by = user]
#   user country Freq
#1:    3       1    4
#2:    3       2    1
#3:    4       1    5
#             ^^^ - why is this 1 instead of 2?!

Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected

dt[, blah, by = user]

to return identical result to

rbind(dt[user == 3, blah], dt[user == 4, blah])

Is that expectation incorrect?

score 7 · Accepted Answer

惯用的 data.table 方法是使用 .N

 dt[ , .N, by = list(user, country)]

这将快得多，并且还将国家/地区保留为与原始类别相同的类别。

score 5 · Accepted Answer

如mnel注释中所述，as.data.frame(table(...))生成一个数据框，其中第一个变量是一个因素。对于user == 4，因子中只有一个级别，内部存储为 1。

你想要的是 factor levels，但你得到的是因素是如何在内部存储的（作为整数，从 1 开始）。以下提供了预期的结果：

> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
   user country Freq
1:    3       1    4
2:    3       2    1
3:    4       2    5

更新。关于你的第二个问题：不，我认为data.table行为是正确的。当您将两个具有不同级别的因素连接起来时，在普通 R 中也会发生同样的事情：

> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3

r - data.table and table unexpected behavior

2 回答 2

Related

Reference