4

我有一个数据数组,其中包含有关人员和项目的一些信息,例如:

person_id | project_id | action | time
--------------------------------------
        1 |          1 |      w |    1
        1 |          2 |      w |    2
        1 |          3 |      w |    2
        1 |          3 |      r |    3
        1 |          3 |      w |    4
        1 |          4 |      w |    4
        2 |          2 |      r |    2
        2 |          2 |      w |    3

我想用另外几个名为“first_time”和“first_time_project”的字段来扩充这些数据,它们共同识别该人第一次看到该人的任何操作以及该开发人员第一次看到对该项目的任何操作。最后,数据应如下所示:

person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
        1 |          1 |      w |    1 |          1 |                  1
        1 |          2 |      w |    2 |          1 |                  2
        1 |          3 |      w |    2 |          1 |                  2
        1 |          3 |      r |    3 |          1 |                  2
        1 |          3 |      w |    4 |          1 |                  2
        1 |          4 |      w |    4 |          1 |                  4
        2 |          2 |      r |    2 |          2 |                  2
        2 |          2 |      w |    3 |          2 |                  2

我这样做的天真方法是编写几个循环:

for (pid in unique(data$person_id)) {
    data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
    for (projid in unique(data[data$pid==pid, "project_id"])) {
        data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
    }
}

现在,不需要天才就可以看到双重嵌套循环会变得非常缓慢。但是,我想不出在 R 中处理这个问题的方法。我有点模拟 SQL 的 group by 选项。我知道 by 可能会有所帮助,但我不知道如何做多个切片。

关于如何将我的代码从冰冷的速度提升到更快的任何提示?我现在对蜗牛很满意。

4

5 回答 5

4

Hadley 的 plyr 和 transform() 的组合很强大。如果我正确理解您的问题,那么:

foo <- ddply(foo, .(person_id), transform, first_time=min(time))
foo <- ddply(foo, .(person_id, project_id), transform, 
  first_time_project=min(time))
于 2011-02-15T02:42:13.060 回答
4

尝试ave

transform(data, 
   first_time = ave(time, person_id, FUN = min),
   first_time_project = ave(time, person_id, project_id, drop = TRUE, FUN = min)
)
于 2011-02-15T04:59:21.950 回答
3

如果速度是您正在寻找的,那么data.table就是要走的路。

library(data.table)
DT <- data.table(foo)
DT[, first_time := min(time), by = person_id]
DT[, first_time_project := min(time), by = list(person_id, project_id)]
于 2012-09-11T01:13:49.377 回答
1

快速而肮脏的解决方案,没有循环

library(plyr)


# function to get first time by any person/project
fp <- function(dat) 
{
dat$first_time=min(dat$time)
ftp <- function(d) { d$first_time_project=min(d$time); return (d) }
dat=ddply(dat, .(project_id), ftp)
return (dat)
}


#this single call should give you the result you want
result=ddply(data, .(person_id), fp) 
于 2011-02-15T01:53:02.263 回答
0

我能想到的快速方法:

foo <- data.frame(
       person_id=rep(1:5,each=6),
       project_id=sample(1:5,30,T),
       time=sample(1:30))

first_time <- aggregate(foo$time, list(foo$person_id), min)

foo$first_time <- first_time[ match(foo$person_id,first_time[,1]),2]

bar <- subset(foo, time==first_time)

foo$first_time_project <- bar$project_id[match(foo$person_id, bar$person_id)]
于 2011-02-15T01:02:48.627 回答