r - R：根据因子身份将两行分组为新行

Question

在一个大型数据框中，我正在尝试创建一个新行，该行根据另一个因素的身份对来自其他行的特定数据进行分组。以下是一些示例数据：

> Species    Status    Value
> A         Introduced   10
> A          Native      3
> B          Crypt       6
> C         Introduced   19
> C          Native      4

对于每个物种，我想创建一个新行，它只获取状态“引入”或“地穴”的数据，而忽略“原生”状态中的数据。每个物种要么只有“引进”和“本地”的数据，要么只有“地穴”的数据。

因此，我想要的输出将如下所示：

> Species    Status    Value
> A         Introduced   10
> A          Native      3
> A         IC.Total     10
> B          Crypt       6
> B         IC.Total     6
> C         Introduced   19
> C          Native      4
> C         IC.Total     19

for 循环是解决此问题的最佳方法，还是有更优雅的方法？任何建议都会非常感谢您的帮助！

score 2 · Accepted Answer

以下使用该data.table包。
假设您的原始 data.frame 被称为myDat：

library(data.table)
myDT <- data.table(myDat, key="Species")

# Creates a new DT, of only the Speices column
myDT2 <- setkey(unique(myDT[, list(Species)]), "Species")

# Add IC.Total values
myDT2[myDT[Status=="Introduced"], c("Status", "ValueC") := list("IC.Total", Value)]

# Add Crypt values
myDT2[myDT[Status=="Crypt"], c("Status", "ValueC") := list("Crypt", Value)]

# fix the column name
setnames(myDT2, "ValueC", "Value")

# combine and sort by speicies
myDT <- setkey(rbind(myDT, myDT2), "Species")

myDT
#    Species     Status Value
# 1:       A Introduced    10
# 2:       A     Native     3
# 3:       A   IC.Total    10
# 4:       B      Crypt     6
# 5:       B      Crypt     6
# 6:       C Introduced    19
# 7:       C     Native     4
# 8:       C   IC.Total    19

请注意，如果您不想重复crypt计数，只需取出上面的那一行。

score 1 · Accepted Answer

您可以使用mergeand aggregate（即使没有任何东西可以聚合）：

merge(mydf, 
      cbind(aggregate(Value ~ Species, mydf, sum, 
                      subset = c(Status != "Native")), 
            Status = "IC.Total"),
      all = TRUE)
#   Species     Status Value
# 1       A Introduced    10
# 2       A     Native     3
# 3       A   IC.Total    10
# 4       B      Crypt     6
# 5       B   IC.Total     6
# 6       C Introduced    19
# 7       C     Native     4
# 8       C   IC.Total    19

我使用aggregate它是因为它有一个方便的参数，可以让您对数据进行子集化。在这种情况下，我们对“Native”不感兴趣。此外，我们知道对于一个物种，我们永远不会同时拥有“Introduced”和“Crypt”，并且我们知道“Introduced”或“Crypt”永远不会有多个值，因此将sum其用作我们的聚合函数不会不要改变任何东西。

更新

即使您有多个“值”变量（正如您在评论中指出的那样），此解决方案背后的这一概念仍然有效，但需要进行一些细微的修改，如下所示。

首先，让我们整理一些数据：

mydf <- data.frame(
  Species = c("A", "A", "B", "C", "C"),
  Status = c("Introduced", "Native", "Crypt", "Introduced", "Native"),
  Value1 = c(10, 3, 6, 19, 4),
  Value2 = c(6, 8, 12, 19, 5),
  Value3 = c(18, 19, 14, 13, 2))
mydf
#   Species     Status Value1 Value2 Value3
# 1       A Introduced     10      6     18
# 2       A     Native      3      8     19
# 3       B      Crypt      6     12     14
# 4       C Introduced     19     19     13
# 5       C     Native      4      5      2

其次，和以前一样使用aggregateand merge，但要注意细微的差别。首先，我们不能使用subset之前的方式，所以不是聚合整个数据集，而是只聚合我们感兴趣的行。其次，我们添加了“状态”作为分组变量，它应该与您所描述的当前数据结构相比，您的结果不会产生任何影响。第三，在我们聚合之后，我们需要删除“状态”列并添加一个新的状态列（这就是[-2]代码正在做的 - 删除第二列。）

在这里，所有内容都在一个整洁的包装中：

merge(mydf, 
      cbind(aggregate(. ~ Species + Status, 
                      mydf[mydf$Status != "Native", ], sum)[-2], 
            Status = "IC.Total"),
      all = TRUE)
#   Species     Status Value1 Value2 Value3
# 1       A Introduced     10      6     18
# 2       A     Native      3      8     19
# 3       A   IC.Total     10      6     18
# 4       B      Crypt      6     12     14
# 5       B   IC.Total      6     12     14
# 6       C Introduced     19     19     13
# 7       C     Native      4      5      2
# 8       C   IC.Total     19     19     13

r - R：根据因子身份将两行分组为新行

2 回答 2

更新

Related

Reference