r - 根据另一个变量的平均值和一个因子在数据框中创建一个数值变量

Question

好的，所以我有一个很可能是一个简单的问题，尽管我发现它很难提出（这可能是问题的根源）。

如果我有以下示例数据：

    V1 <- c(1,1,1,1,1,2,2,2,2,2)
    factor <- factor(V1)
    V2 <- c(1,2,3,4,5,6,7,8,9,10)
    V3 <- c(10,20,30,40,50,60,70,80,90,100)
    test <- data.frame(factor,V2,V3)

我该如何生成另一个变量，比如说 V4，即每个因子水平的 V3 的平均值？我可以使用例如 tapply 获得平均值：

    tapply(test$V3, test$factor, FUN=mean)

在这种情况下，这将分别导致 30 和 80，但我希望它形成一个重复变量，相关因子水平的长度如下：

      factor V2  V3 v4
   1       1  1  10 30
   2       1  2  20 30
   3       1  3  30 30
   4       1  4  40 30
   5       1  5  50 30
   6       2  6  60 80
   7       2  7  70 80
   8       2  8  80 80
   9       2  9  90 80
   10      2 10 100 80

欢迎任何建议和解决方案以及如何更好地表达问题。

score 4 · Accepted Answer

使用ave代替tapply：

within(test, {
  V4 <- ave(V3, factor, FUN = mean)
})
   factor V2  V3 V4
1       1  1  10 30
2       1  2  20 30
3       1  3  30 30
4       1  4  40 30
5       1  5  50 30
6       2  6  60 80
7       2  7  70 80
8       2  8  80 80
9       2  9  90 80
10      2 10 100 80

该构造与您使用的方式非常相似tapply。我使用within的原因有两个：（1）节省一些输入，（2）允许我们自动创建新列。

该data.table包为这些类型的操作提供了一些非常方便的语法：

> library(data.table)
data.table 1.8.8  For help type: help("data.table")
> DT <- data.table(test)
> DT[, V4 := mean(V3), by = factor]
> DT
    factor V2  V3 V4
 1:      1  1  10 30
 2:      1  2  20 30
 3:      1  3  30 30
 4:      1  4  40 30
 5:      1  5  50 30
 6:      2  6  60 80
 7:      2  7  70 80
 8:      2  8  80 80
 9:      2  9  90 80
10:      2 10 100 80

不要让读者不知所措，但有很多方法可以做到这一点。以下是基础 R 中的另外两个解决方案（尽管比已经共享的替代方案效率低得多）。

aggregate

merge(test, 
      setNames(aggregate(V3 ~ factor, test, mean), 
               c("factor", "V4")), all = TRUE)

利用你的tapply输出。

temp <- tapply(test$V3, test$factor, FUN=mean)
temp <- data.frame(V4 = temp)
merge(test, temp, by.x = "factor", by.y = "row.names", all = TRUE)

score 1 · Accepted Answer

这是一个解决方案plyr：

R> ddply(test, .(factor), transform, V4=mean(V3))
   factor V2  V3 V4
1       1  1  10 30
2       1  2  20 30
3       1  3  30 30
4       1  4  40 30
5       1  5  50 30
6       2  6  60 80
7       2  7  70 80
8       2  8  80 80
9       2  9  90 80
10      2 10 100 80

r - 根据另一个变量的平均值和一个因子在数据框中创建一个数值变量

2 回答 2

Related

Reference