r - 获取每个 id 的最大值，然后只获取每个 id R 的值

Question

我想根据每个人的最大数量，每个日期每个人只进行一次观察，从而使我的 df 更小。

这是我的df：

names      dates quantity
1    tom 2010-02-01       28
3    tom 2010-03-01        7
2   mary 2010-05-01       30
6    tom 2010-06-01       21
4   john 2010-07-01       45
5   mary 2010-07-01       30
8   mary 2010-07-01       28
11   tom 2010-08-01       28
7   john 2010-09-01       28
10  john 2010-09-01       30
9   john 2010-07-01       45
12  mary 2010-11-01       28
13  john 2010-12-01        7
14  john 2010-12-01       14

我首先通过找到每个人每个日期的最大数量来做到这一点。这没问题，但正如您所看到的，如果一个人的数量相等，他们每个日期保留相同数量的 obs。

merge(df, aggregate(quantity ~ names+dates, df, max))



 names      dates quantity
1   john 2010-07-01       45
2   john 2010-07-01       45
3   john 2010-09-01       30
4   john 2010-12-01       14
5   mary 2010-05-01       30
6   mary 2010-07-01       30
7   mary 2010-11-01       28
8    tom 2010-02-01       28
9    tom 2010-03-01        7
10   tom 2010-06-01       21
11   tom 2010-08-01       28

所以，我的下一步是每个日期只取第一个 obs（鉴于我已经选择了最大的数量）。我无法得到正确的代码。这是我尝试过的：

merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m  ##doesn't get rid of one obs for john

和一个 data.table 选项

l[, .SD[1], by=c(names,dates)]  ##doesn't work at all

我喜欢 aggregate 和 data.table 选项，因为它们速度很快，并且 df 有大约 100k 行。

提前感谢您！

解决方案

我发的太快了——抱歉！！解决这个问题的一个简单方法就是找到重复项，然后删除它们。例如，;

merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]

这是系统时间

 system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
   user  system elapsed 
  20.04    0.04   20.07 



 system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
   user  system elapsed 
 19.129   0.028  19.148

score 2 · Accepted Answer

我不确定这会给你想要的输出，但它肯定会处理“重复行”：

 # Replicating your dataframe
 df <- data.frame(names = c("tom", "tom", "mary", "tom", "john", "mary", "mary", "tom", "john", "john", "john", "mary", "john", "john"), dates = c("2010-02-01","2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-07-01", "2010-07-01", "2010-08-01", "2010-09-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-12-01", "2010-12-01"), quantity = c(28,7,30,21,45,30,28,28,28,30,45,28,7,14)) 

 temp = merge(df, aggregate(quantity ~ names+dates, df, max))
 df.unique = unique(temp)

score 2 · Accepted Answer

这是一个data.table解决方案：

dt[, max(quantity), by = list(names, dates)]

长椅：

N = 1e6

dt = data.table(names = sample(letters, N, T), dates = sample(LETTERS, N, T), quantity = rnorm(N))
df = data.frame(dt)

op = function(df) aggregate(quantity ~ names+dates, df, max) 
eddi = function(dt) dt[, max(quantity), by = list(names, dates)]

microbenchmark(op(df), eddi(dt), times = 10)
#Unit: milliseconds
#     expr      min        lq   median        uq      max neval
#   op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209    10
# eddi(dt)  148.088  162.8073  198.222  220.1217  286.058    10

score 1 · Accepted Answer

如果您使用的是 data.frame，

 library(plyr)
    ddply(mydata,.(names,dates),summarize, maxquantity=max(quantity))

score 1 · Accepted Answer

do.call( rbind, 
        lapply( split(df, df[,c("names","dates") ]), function(d){
                                         d[which.max(d$quantity), ] } )
        )

r - 获取每个 id 的最大值，然后只获取每个 id R 的值

4 回答 4

Related

Reference