7

我有这个问题,data.table最近让我发疯。它看起来像一个错误,但可能是我在这里遗漏了一些明显的东西。

我有以下数据框:

# First some data
data <- data.table(structure(list(
  month = structure(c(1356998400, 1356998400, 1356998400, 
                      1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400, 
                      1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800, 
                      1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 
                      1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct", 
                                                                                 "POSIXt"), tzone = "UTC"), 
  portal = c(TRUE, TRUE, FALSE, TRUE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
  ), 
  satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 
                   9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L, 
                   10L, 10L)), 
                  .Names = c("month", "portal", "satisfaction"), 
                  row.names = c(NA, -25L), class = "data.frame"))

我想用portal和来总结一下month。按预期总结好的旧tapply作品 - 我得到 3x2 矩阵,其中包含 2012 年 12 月和 2013 年 1 月至 2 月的结果:

> tapply(data$satisfaction, list(data$month, data$portal), mean)
           FALSE      TRUE
2012-12-01   8.5  8.000000
2013-01-01  10.0 10.000000
2013-02-01   9.0  9.545455

总结与by论点data.table不:

> data[, mean(satisfaction), by = 'month,portal']
   month      portal        V1
1: 2013-01-01  FALSE 10.000000
2: 2013-02-01   TRUE  9.000000
3: 2013-01-01   TRUE 10.000000
4: 2012-12-01  FALSE  8.500000
5: 2012-12-01   TRUE  7.333333
6: 2013-02-01   TRUE  9.666667
7: 2013-02-01  FALSE  9.000000
8: 2012-12-01   TRUE 10.000000

如您所见,它返回一个包含8个值的数据表,而不是预期的6 个;例如,重复portal == TRUE和的值。month == 2012-02-01

有趣的是,如果我将其限制为 2013 年的数据,一切都会按预期进行:

> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal']
        month portal        V1
1: 2013-01-01   TRUE 10.000000
2: 2013-01-01  FALSE 10.000000
3: 2013-02-01   TRUE  9.545455
4: 2013-02-01  FALSE  9.000000

我很困惑,难以置信:)。有人可以帮我吗?

4

2 回答 2

8

这是一个在 data.table 1.8.7 中解决的已知问题(在撰写本文时还没有在 CRAN 中)。

来自 data.table新闻

BUG FIXES

    <...>

o   setkey could sort 'double' columns (such as POSIXct) incorrectly when not the
    last column of the key, #2484. In data.table's C code :
        x[a] > x[b]-tol
    should have been :
        x[a]-x[b] > -tol  [or  x[b]-x[a] < tol ]
    The difference may have been machine/compiler dependent. Many thanks to statquant
    for the short reproducible example. Test added.

使用 更新到 1.8.7 后install.packages("data.table", repos="http://R-Forge.R-project.org"),一切正常。

于 2013-02-25T22:42:50.587 回答
5

问题似乎与排序有关。当我加载data并执行以下操作时setkey

setkey(data, "month", "portal")

# > data
#          month portal satisfaction
#  1: 2012-12-01   TRUE           10
#  2: 2012-12-01  FALSE            9
#  3: 2012-12-01  FALSE            8
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01   TRUE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01  FALSE           10
# 14: 2013-02-01   TRUE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01  FALSE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

您会看到该portal列未正确排序。当我setkey再次这样做时,

setkey(data, "month", "portal")

# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
  Already keyed by this key but had invalid row order, key rebuilt. 
  If you didn't go under the hood please let datatable-help know so 
  the root cause can be fixed.

现在,这些data列似乎按关键列正确排序:

# > data
#          month portal satisfaction
#  1: 2012-12-01  FALSE            9
#  2: 2012-12-01  FALSE            8
#  3: 2012-12-01   TRUE           10
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01  FALSE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01   TRUE           10
# 14: 2013-02-01  FALSE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01   TRUE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

那么,排序类型似乎是一个问题POSIXct + logical

data[, mean(satisfaction), by=list(month, portal)]

#         month portal        V1
# 1: 2012-12-01  FALSE  8.500000
# 2: 2012-12-01   TRUE  8.000000
# 3: 2013-01-01  FALSE 10.000000
# 4: 2013-01-01   TRUE 10.000000
# 5: 2013-02-01  FALSE  9.000000
# 6: 2013-02-01   TRUE  9.545455

因此,我认为有一个错误。

于 2013-02-25T22:09:00.020 回答