r - Removing records from a data frame based on comparison to other records

Question

I have a data frame that looks like this:

Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34

I would like to strip down the data frame by only keeping the record for each Reach with the highest HQ, resulting in this:

 Reach Chem HQ 
 a Nickel  1.65
 b Cadmium 3.12
 c Nickel 2.34

What is the best way to do this?

score 4 · Accepted Answer

这是基础 R 中的一种（或接近）方法。

获取数据：

test <- read.table(textConnection("Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34"),header=TRUE)

返回每个Reach组中使用率最高HQ的行by，然后which.max 将do.call(rbind...标识的行连接到一个数据集中。

do.call(rbind,by(test,test$Reach,function(x) x[which.max(x$HQ),]))

结果：

  Reach    Chem   HQ
a     a  Nickel 1.65
b     b Cadmium 3.12
c     c  Nickel 2.34

编辑-解决mindless.panda和joran下面关于最大值是否存在联系的讨论，这将起作用：

do.call(rbind,by(test,test$Reach,function(x) x[x$HQ==max(x$HQ),]))

score 3 · Accepted Answer

如果你喜欢plyr方法：

data <- read.table(text="Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34", header=TRUE)

require(plyr)
ddply(data, .(Reach), summarize, Chem=Chem[which.max(HQ)], MaxHQ=max(HQ))

  Reach    Chem  MaxHQ
1     a  Nickel   1.65
2     b Cadmium   3.12
3     c  Nickel   2.34

编辑：

部分出于这个类似问题的动机，并考虑了不止一种Chem类型列（列不是子集）并且Chem=Chem[which.max(HQ)]为每个列复制会变得冗长的情况，我想出了这个。plyr如果有更好的方法可以做到这一点，我很好奇向导是否可以权衡：

# add the within-group max HQ as a column
df <- ddply(data, .(Reach), transform, MaxHQByReach=max(HQ))

# now select the rows where the HQ equals the Max HQ, dropping the above column
subset(df, df$HQ==df$MaxHQByReach)[,1:(ncol(df)-1)]

score 3 · Accepted Answer

也许您可以尝试像这样使用 ?order 和 ?duplicated ：

my_df = data.frame(
    Reach = c("a","a","b","b","b","c","c"), 
    Chem = c("Mercury","Nickel","Mercury","Nickel","Cadmium","Mercury","Nickel"),
    HQ = c(1.12,1.65,1.54,2.34,3.12,2.12,2.34)
    )

my_df = my_df[order(my_df$HQ,decreasing=TRUE),]
my_df = my_df[!duplicated(my_df$Reach),]
my_df = my_df[order(my_df$Reach),]

编辑：为清楚起见，结果如下所示。

  Reach    Chem   HQ
2     a  Nickel 1.65
5     b Cadmium 3.12
7     c  Nickel 2.34

score 2 · Accepted Answer

嗨，您也可以像这样使用 max 和 lapply ：

Reach <- unique(my_df$Reach)
        HQ <- unlist(lapply(1:length(unique(my_df$Reach)),function(x) max(my_df$HQ[which(my_df$Reach == unique(my_df$Reach)[x])])))

        Chem <- my_df$Chem[match(lapply(1:length(unique(my_df$Reach)),function(x) max(my_df$HQ[which(my_df$Reach == unique(my_df$Reach)[x])])),my_df$HQ)]

            new.df <- data.frame(Reach,Chem,HQ)
        new.df

          Reach    Chem   HQ
        1     a  Nickel 1.65
        2     b Cadmium 3.12
        3     c  Nickel 2.34

r - Removing records from a data frame based on comparison to other records

4 回答 4

Related

Reference