r - 基于多个其他因子列的总和列

Question

我有以下数据框：

df<-structure(list(totprivland = c(175L, 50L, 100L, 14L, 4L, 240L, 
10L, 20L, 20L, 58L), ncushr8d1 = c(0L, 0L, 0L, 0L, 0L, 30L, 5L, 
0L, 0L, 50L), ncu_CENREG1 = structure(c(4L, 4L, 4L, 4L, 1L, 3L, 
3L, 3L, 4L, 4L), .Label = c("Northeast", "Midwest", "South", 
"West"), class = "factor"), ncushr8d2 = c(75L, 50L, 100L, 14L, 
2L, 30L, 5L, 20L, 20L, 8L), ncu_CENREG2 = structure(c(4L, 4L, 
4L, 4L, 1L, 2L, 1L, 4L, 3L, 4L), .Label = c("Northeast", "Midwest", 
"South", "West"), class = "factor"), ncushr8d3 = c(100L, NA, 
NA, NA, 2L, 180L, 0L, NA, NA, NA), ncu_CENREG3 = structure(c(4L, 
NA, NA, NA, 1L, 1L, 3L, NA, NA, NA), .Label = c("Northeast", 
"Midwest", "South", "West"), class = "factor"), ncushr8d4 = c(NA, 
NA, NA, NA, 0L, NA, NA, NA, NA, NA), ncu_CENREG4 = structure(c(NA, 
NA, NA, NA, 1L, NA, NA, NA, NA, NA), .Label = c("Northeast", 
"Midwest", "South", "West"), class = "factor")), .Names = c("totprivland", 
"ncushr8d1", "ncu_CENREG1", "ncushr8d2", "ncu_CENREG2", "ncushr8d3", 
"ncu_CENREG3", "ncushr8d4", "ncu_CENREG4"), row.names = c(27404L, 
27525L, 27576L, 27822L, 28099L, 28238L, 28306L, 28312L, 28348L, 
28379L), class = "data.frame")

=======

这是dput以下基本思想：

Total    VariableA  LocationA    VariableB     LocationB
30            20       East          10         East
20            20       South         NA         West
115           15       East         100         South
100           50       West          50         West 
35            10       East          25         South

总数（或 dput 示例中的 totprivland）是变量（ncushr8d1、ncushr8d2、ncushr8d3 和 ncushr8d4）的总和，每个变量都有一个对应的因子位置变量（ncu_CENREG1 等）。在这个相同的模式中还有 6 个额外的变量和位置。对于多个数值变量，位置变量通常是相同的值（例如，多个“东”位置值，如示例的第一行）。

我想通过公共位置因子获得每行的值的总和，为每个位置的总和创建一个新列。它看起来像这样，可以忽略 NA 值：

Total    VariableA  LocationA    VariableB     LocationB   TotalWest  TotalEast TotalSouth
30            20       East          10         East          0          30          0
20            20       South         NA         NA            0           0         20
115           15       East         100         South         0          15        100
100           50       West          50         West        100           0          0 
35            10       East          25         South         0          10         25

我研究了聚合和拆分，但似乎无法弄清楚如何让它们在这么多列中工作。我也在考虑一个冗长的“if”语句，它将遍历所有 8 个变量及其相应的位置，但我觉得必须有一个更好的解决方案。观察被加权以用于调查包，我想避免重复观察并使它们与 reshape 包“长”，尽管也许我可以稍后重新组合它们。任何建议表示赞赏！

非常感谢，卢克

score 0 · Accepted Answer

首先，我会将数据框转换为长格式，其中包含 3 列：值、位置、大小写。case 应该指出数据来自哪个案例（例如行）。顺序无关紧要。所以你的数据框看起来像：

Value    Loc    Case
20       East   1
20       South  2
...
10       East   1

等等...一种方法是堆叠您的值和位置，然后手动（轻松）添加案例编号。假设您的原始数据框称为 df，并且在第 2,4 列中具有值，在第 3,5 列中具有位置

v.col = stack(df[,c(2,4)])[,1]
v.loc = stack(df[,c(3,5)])[,1]
v.case = rep(1:nrow(df),2)
long.data = data.frame(v.col,v.loc,v.case)    # this is not actually needed, but just so you can view it

现在使用 tapply 创建您需要的列

s = tapply(v.col,list(v.case,v.loc),sum,na.rm=T)
new.df = cbind(df,s)

您可能需要将 NA 调整为 0 或其他东西，但这应该很容易。

使用 plyr/reshape 包可能还有更简单的方法来做到这一点，但我不是这些方面的专家。

希望这可以帮助

r - 基于多个其他因子列的总和列

1 回答 1

Related

Reference