1

I have a data frame that looks like this:

Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34

I would like to strip down the data frame by only keeping the record for each Reach with the highest HQ, resulting in this:

 Reach Chem HQ 
 a Nickel  1.65
 b Cadmium 3.12
 c Nickel 2.34

What is the best way to do this?

4

4 回答 4

4

这是基础 R 中的一种(或接近)方法。

获取数据:

test <- read.table(textConnection("Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34"),header=TRUE)

返回每个Reach组中使用率最高HQ的行by,然后which.maxdo.call(rbind...标识的行连接到一个数据集中。

do.call(rbind,by(test,test$Reach,function(x) x[which.max(x$HQ),]))

结果:

  Reach    Chem   HQ
a     a  Nickel 1.65
b     b Cadmium 3.12
c     c  Nickel 2.34

编辑-解决mindless.panda和joran下面关于最大值是否存在联系的讨论,这将起作用:

do.call(rbind,by(test,test$Reach,function(x) x[x$HQ==max(x$HQ),]))
于 2012-07-27T06:21:36.537 回答
3

如果你喜欢plyr方法:

data <- read.table(text="Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34", header=TRUE)

require(plyr)
ddply(data, .(Reach), summarize, Chem=Chem[which.max(HQ)], MaxHQ=max(HQ))

  Reach    Chem  MaxHQ
1     a  Nickel   1.65
2     b Cadmium   3.12
3     c  Nickel   2.34

编辑:

部分出于这个类似问题的动机,并考虑了不止一种Chem类型列(列不是子集)并且Chem=Chem[which.max(HQ)]为每个列复制会变得冗长的情况,我想出了这个。plyr如果有更好的方法可以做到这一点,我很好奇向导是否可以权衡:

# add the within-group max HQ as a column
df <- ddply(data, .(Reach), transform, MaxHQByReach=max(HQ))

# now select the rows where the HQ equals the Max HQ, dropping the above column
subset(df, df$HQ==df$MaxHQByReach)[,1:(ncol(df)-1)]
于 2012-07-27T01:34:46.430 回答
3

也许您可以尝试像这样使用 ?order 和 ?duplicated :

my_df = data.frame(
    Reach = c("a","a","b","b","b","c","c"), 
    Chem = c("Mercury","Nickel","Mercury","Nickel","Cadmium","Mercury","Nickel"),
    HQ = c(1.12,1.65,1.54,2.34,3.12,2.12,2.34)
    )

my_df = my_df[order(my_df$HQ,decreasing=TRUE),]
my_df = my_df[!duplicated(my_df$Reach),]
my_df = my_df[order(my_df$Reach),]

编辑:为清楚起见,结果如下所示。

  Reach    Chem   HQ
2     a  Nickel 1.65
5     b Cadmium 3.12
7     c  Nickel 2.34
于 2012-07-27T00:41:43.223 回答
2

嗨,您也可以像这样使用 max 和 lapply :

Reach <- unique(my_df$Reach)
        HQ <- unlist(lapply(1:length(unique(my_df$Reach)),function(x) max(my_df$HQ[which(my_df$Reach == unique(my_df$Reach)[x])])))

        Chem <- my_df$Chem[match(lapply(1:length(unique(my_df$Reach)),function(x) max(my_df$HQ[which(my_df$Reach == unique(my_df$Reach)[x])])),my_df$HQ)]

            new.df <- data.frame(Reach,Chem,HQ)
        new.df

          Reach    Chem   HQ
        1     a  Nickel 1.65
        2     b Cadmium 3.12
        3     c  Nickel 2.34
于 2012-07-27T05:22:38.943 回答