5

该练习包括通过将因子与 R 中的 data.table 组合来聚合值的数值向量。以以下数据表为例:

require (data.table)
require (plyr)
dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3),
                                       fac = letters[1:3]),
                          value = rnorm (27)))

请注意,'month' 和 'fac' 的每个唯一组合都会出现 3 次。因此,当我尝试通过这两个因素对值进行平均时,我应该期望一个具有 9 个唯一行的数据框:

(agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value)))
  month fac          V1
1   Jan   a -0.36030953
2   Jan   b -0.58444588
3   Jan   c -0.15472876
4   Feb   a -0.05674483
5   Feb   b  0.26415972
6   Feb   c -1.62346772
7   Mar   a  0.24560510
8   Mar   b  0.82548140
9   Mar   c  0.18721114

但是,当与 data.table 聚合时,我不断得到两个因素的每个冗余组合提供的结果:

(agg2 <- dtb[, value := mean (value), by = list (month, fac)])
    month fac       value
 1:   Jan   a -0.36030953
 2:   Jan   a -0.36030953
 3:   Jan   a -0.36030953
 4:   Feb   a -0.05674483
 5:   Feb   a -0.05674483
 6:   Feb   a -0.05674483
 7:   Mar   a  0.24560510
 8:   Mar   a  0.24560510
 9:   Mar   a  0.24560510
10:   Jan   b -0.58444588
11:   Jan   b -0.58444588
12:   Jan   b -0.58444588
13:   Feb   b  0.26415972
14:   Feb   b  0.26415972
15:   Feb   b  0.26415972
16:   Mar   b  0.82548140
17:   Mar   b  0.82548140
18:   Mar   b  0.82548140
19:   Jan   c -0.15472876
20:   Jan   c -0.15472876
21:   Jan   c -0.15472876
22:   Feb   c -1.62346772
23:   Feb   c -1.62346772
24:   Feb   c -1.62346772
25:   Mar   c  0.18721114
26:   Mar   c  0.18721114
27:   Mar   c  0.18721114
    month fac       value

有没有一种优雅的方法可以将这些结果与数据表的每个独特因素组合折叠成一行?

4

2 回答 2

9

问题(和推理)与分配聚合值而不仅仅是计算的事实有关。

如果您查看包含更多列的 data.table,而不仅仅是用于计算的列,则更容易观察到这一点。

# Therefore, let's add a new column
dtb[, newCol := LETTERS[seq(length(value))]

请注意,如果我们只想输出计算值,那么RHS您所拥有的表达式就可以了。

# This gives the expected results
dtb[, mean (value), by = list (month, fac)]

# This on the other hand assigns the respective values to *each* row
dtb[, value := mean (value), by = list (month, fac)]

换句话说,数据被子集化以仅返回唯一值。
但是,如果您想将此值保存回SAME数据表中(使用运算符时会发生这种情况:=),那么(默认为所有行)中标识的i所有行都将被分配一个值。(当您查看带有附加列的输出时,这是有道理的)

然后将此 data.table 复制到 agg 仍然会发送所有行。

因此,如果要复制到新表,只有原始表中唯一的那些行,您可以

a.  wrap the original table inside `unique()` before assigning it
b.  assign the table, above, that is returned when you 
    are not assigning the RHS output (which is what @Arun suggested)

一个例子a.是:

 agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)])

以下示例可能有助于说明。

(您需要复制 + 粘贴此内容,因为省略了输出)

  # SAMPLE DATA, as above
  library(data.table)
  dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27))

  #  METHOD 1  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.


  dtb[, value := mean (value), by = list (month, fac)]
  dtb

  # this is what you would like to assign
  unique(dtb)


  #  METHOD 2  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.

  # this is what you would like to assign
  # next two lines are the same, only differnce is column name
  dtb[, mean (value), by = list (month, fac)]
  dtb[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity

  # dtb is unchanged. 
  dtb



  # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN
  dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)]


  dtb1 <- copy(dtb.bak)  # restore, from sample data.
  dtb2 <- copy(dtb.bak)  # restore, from sample data.


  # Method 1
  dtb1[, value := mean (value), by = list (month, fac)]
  dtb1
  unique(dtb1)

  #  METHOD 2  # 
  dtb2[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity
  dtb2

  # METHOD 2, WITH ADDED COLUMNS IN list() in `j`
  dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)]  # quote marks added for clarity
  # notice this has more columns thatn 
  unique(dtb1)
于 2013-03-05T21:37:02.050 回答
6

你应该做:

agg2 <- dtb[, list(value = mean(value)), by = list (month, fac)]

:=将回收 的值RHS以适应LHS. ?':='阅读更多关于此的内容。

于 2013-03-05T19:08:39.783 回答