r - 在给定列上聚合数据框并显示另一列

Question

我在 R 中有一个如下形式的数据框：

> head(data)
  Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f

我想使用函数在Score列之后聚合它max

> aggregate(data$Score, list(data$Group), max)

  Group.1         x
1       1         3
2       2         4

但我也想显示与每组列Info的最大值关联的列。Score我不知道该怎么做。我想要的输出是：

  Group.1         x        y
1       1         3        c
2       2         4        d

有什么提示吗？

score 53 · Accepted Answer

一个基本的 R 解决方案是将的输出aggregate()与一个merge()步骤结合起来。我发现公式接口aggregate()比标准接口更有用，部分原因是输出上的名称更好，所以我将使用它：

aggregate()步骤是

maxs <- aggregate(Score ~ Group, data = dat, FUN = max)

步骤很merge()简单

merge(maxs, dat)

这给了我们想要的输出：

R> maxs <- aggregate(Score ~ Group, data = dat, FUN = max)
R> merge(maxs, dat)
  Group Score Info
1     1     3    c
2     2     4    d

当然，您可以将其粘贴到单行中（中间步骤更多用于说明）：

merge(aggregate(Score ~ Group, data = dat, FUN = max), dat)

我使用公式接口的主要原因是它返回了一个具有正确names合并步骤的数据框；这些是原始数据集中列的名称dat。我们需要让输出aggregate()具有正确的名称，以便merge()知道原始数据帧和聚合数据帧中的哪些列匹配。

标准接口给出了奇怪的名字，不管你怎么称呼它：

R> aggregate(dat$Score, list(dat$Group), max)
  Group.1 x
1       1 3
2       2 4
R> with(dat, aggregate(Score, list(Group), max))
  Group.1 x
1       1 3
2       2 4

我们可以merge()在这些输出上使用，但我们需要做更多的工作来告诉 R 哪些列匹配。

score 38 · Accepted Answer

首先，您使用以下方法拆分数据split：

split(z,z$Group)

然后，对于每个块，选择得分最高的行：

lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),])

do.call最后归结为 data.frame rbind：

do.call(rbind,lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),]))

结果：

  Group Score Info
1     1     3    c
2     2     4    d

一行，没有魔法，速度很快，结果有好名字=）

score 15 · Accepted Answer

这是使用该plyr软件包的解决方案。

以下代码行本质上告诉ddply您首先按组对数据进行分组，然后在每个组内返回一个子集，其中分数等于该组中的最高分数。

library(plyr)
ddply(data, .(Group), function(x)x[x$Score==max(x$Score), ])

  Group Score Info
1     1     3    c
2     2     4    d

而且，正如@SachaEpskamp 指出的那样，这可以进一步简化为：

ddply(df, .(Group), function(x)x[which.max(x$Score), ])

which.max（如果有的话，它还具有返回多条最大线的优点）。

score 5 · Accepted Answer

该plyr软件包可用于此目的。使用该ddply()函数，您可以在一列或多列上拆分数据框并应用函数并返回数据框，然后使用该summarize()函数，您可以使用拆分后的数据框的列作为变量来制作新的数据框/；

dat <- read.table(textConnection('Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f'))

library("plyr")

ddply(dat,.(Group),summarize,
    Max = max(Score),
    Info = Info[which.max(Score)])
  Group Max Info
1     1   3    c
2     2   4    d

score 5 · Accepted Answer

一个迟到的答案，但和方法使用data.table

library(data.table)
DT <- data.table(dat)

DT[, .SD[which.max(Score),], by = Group]

或者，如果可能有多个相同的最高分数

DT[, .SD[which(Score == max(Score)),], by = Group]

注意到（从?data.table

.SD是一个 data.table，其中包含每个组的 x 数据子集，不包括组列

score 5 · Accepted Answer

添加到 Gavin 的答案：在合并之前，可以在不使用公式界面时聚合以使用专有名称：

aggregate(data[,"score", drop=F], list(group=data$group), mean)

score 4 · Accepted Answer

这就是我base对这个问题的看法。

my.df <- data.frame(group = rep(c(1,2), each = 3), 
        score = runif(6), info = letters[1:6])
my.agg <- with(my.df, aggregate(score, list(group), max))
my.df.split <- with(my.df, split(x = my.df, f = group))
my.agg$info <- unlist(lapply(my.df.split, FUN = function(x) {
            x[which(x$score == max(x$score)), "info"]
        }))

> my.agg
  Group.1         x info
1       1 0.9344336    a
2       2 0.7699763    e

score 4 · Accepted Answer

我没有足够高的声誉来评论 Gavin Simpson 的回答，但我想警告说，标准语法和aggregate.

#Create some data with missing values 
a<-data.frame(day=rep(1,5),hour=c(1,2,3,3,4),val=c(1,NA,3,NA,5))
  day hour val
1   1    1   1
2   1    2  NA
3   1    3   3
4   1    3  NA
5   1    4   5

#Standard syntax
aggregate(a$val,by=list(day=a$day,hour=a$hour),mean,na.rm=T)
  day hour   x
1   1    1   1
2   1    2 NaN
3   1    3   3
4   1    4   5

#Formula syntax.  Note the index for hour 2 has been silently dropped.
aggregate(val ~ hour + day,data=a,mean,na.rm=T)
  hour day val
1    1   1   1
2    3   1   3
3    4   1   5

r - 在给定列上聚合数据框并显示另一列

8 回答 8

Related

Reference