1

我想根据每个人的最大数量,每个日期每个人只进行一次观察,从而使我的 df 更小。

这是我的df:

names      dates quantity
1    tom 2010-02-01       28
3    tom 2010-03-01        7
2   mary 2010-05-01       30
6    tom 2010-06-01       21
4   john 2010-07-01       45
5   mary 2010-07-01       30
8   mary 2010-07-01       28
11   tom 2010-08-01       28
7   john 2010-09-01       28
10  john 2010-09-01       30
9   john 2010-07-01       45
12  mary 2010-11-01       28
13  john 2010-12-01        7
14  john 2010-12-01       14

我首先通过找到每个人每个日期的最大数量来做到这一点。这没问题,但正如您所看到的,如果一个人的数量相等,他们每个日期保留相同数量的 obs。

merge(df, aggregate(quantity ~ names+dates, df, max))



 names      dates quantity
1   john 2010-07-01       45
2   john 2010-07-01       45
3   john 2010-09-01       30
4   john 2010-12-01       14
5   mary 2010-05-01       30
6   mary 2010-07-01       30
7   mary 2010-11-01       28
8    tom 2010-02-01       28
9    tom 2010-03-01        7
10   tom 2010-06-01       21
11   tom 2010-08-01       28

所以,我的下一步是每个日期只取第一个 obs(鉴于我已经选择了最大的数量)。我无法得到正确的代码。这是我尝试过的:

merge(l, aggregate(names ~ dates, l, FUN=function(z) z[1]))->m  ##doesn't get rid of one obs for john

和一个 data.table 选项

l[, .SD[1], by=c(names,dates)]  ##doesn't work at all

我喜欢 aggregate 和 data.table 选项,因为它们速度很快,并且 df 有大约 100k 行。

提前感谢您!

解决方案

我发的太快了——抱歉!!解决这个问题的一个简单方法就是找到重复项,然后删除它们。例如,;

merge(df, aggregate(quantity ~ names+dates, df, max))->toy
toy$dup<-duplicated(toy)
toy<-toy[toy$dup!=TRUE,]

这是系统时间

 system.time(dt2[, max(new_quan), by = list(hai_dispense_number, date_of_claim)]->method1)
   user  system elapsed 
  20.04    0.04   20.07 



 system.time(aggregate(new_quan ~ hai_dispense_number+date_of_claim, dt2, max)->rpp)
   user  system elapsed 
 19.129   0.028  19.148 
4

4 回答 4

2

我不确定这会给你想要的输出,但它肯定会处理“重复行”:

 # Replicating your dataframe
 df <- data.frame(names = c("tom", "tom", "mary", "tom", "john", "mary", "mary", "tom", "john", "john", "john", "mary", "john", "john"), dates = c("2010-02-01","2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-07-01", "2010-07-01", "2010-08-01", "2010-09-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-12-01", "2010-12-01"), quantity = c(28,7,30,21,45,30,28,28,28,30,45,28,7,14)) 

 temp = merge(df, aggregate(quantity ~ names+dates, df, max))
 df.unique = unique(temp)
于 2013-08-19T18:17:31.133 回答
2

这是一个data.table解决方案:

dt[, max(quantity), by = list(names, dates)]

长椅:

N = 1e6

dt = data.table(names = sample(letters, N, T), dates = sample(LETTERS, N, T), quantity = rnorm(N))
df = data.frame(dt)

op = function(df) aggregate(quantity ~ names+dates, df, max) 
eddi = function(dt) dt[, max(quantity), by = list(names, dates)]

microbenchmark(op(df), eddi(dt), times = 10)
#Unit: milliseconds
#     expr      min        lq   median        uq      max neval
#   op(df) 2535.241 3025.1485 3195.078 3398.4404 3533.209    10
# eddi(dt)  148.088  162.8073  198.222  220.1217  286.058    10
于 2013-08-19T18:34:07.333 回答
1

如果您使用的是 data.frame,

 library(plyr)
    ddply(mydata,.(names,dates),summarize, maxquantity=max(quantity))
于 2013-08-19T18:16:23.160 回答
1
do.call( rbind, 
        lapply( split(df, df[,c("names","dates") ]), function(d){
                                         d[which.max(d$quantity), ] } )
        )
于 2013-08-19T18:20:37.707 回答