6

I'm struggling to get the following done:

Example dataset:

   belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 

The problem is: I would like to extract the most recent entry (largest value for time) per belongID, unless this rating is 0. If the rating of the most recent entry is 0 however. I want the first entry with a rating (not the highest rating, just the first value with a rating that is not zero). If all other entries are also zero, the most recent one needs to be selected.

The end result should than be:

   belongID   uniqID   Time   Rating  
   1           101       5      0  
   2           105       2      5
   3           108       5      1  

The dataset is pretty large and is ordered by belongID. It is not ordered by time, so more recent entries may come after older entries with the same belongID.

Without having the "0 Rating" constraint, I used the following function to calculate the most recent entry:

>uniqueMax <- function(m, belongID = 1, time = 3) {
    t(
      vapply(
         split(1:nrow(m), m[,belongID]), 
         function(i, x, time) x[i, , drop=FALSE][which.max(x[i,time]),], m[1,], x=m, time=time
      )
    )
 }

I do not know how to incorporate the "0 Rating" constraint.

EDIT: A follow up question:

Does anyone know how the getRating function should be altered if not only rating zero, but more ratings need to be taken into account (for instance 0,1,4 and 5)? Thus assign to most recent, unless Rating 0 or 1 or 4 or 5? If Rating is 0,1,4,5 assign to most recent entry with a different rating. If all ratings are 0,1,4 or 5 assign to the most recent of those. I tried the following, but that did not work:

getRating <- function(x){
  iszero <- x$Rating == 0 | x$Rating == 1 | x$Rating == 4 | x$Rating ==5
  if(all(iszero)){
    id <- which.max(x$Time)
  } else {
    id <- which.max((!iszero)*x$Time) 
            # This trick guarantees taking 0 into account
  }
  x[id,]
}
# Do this over the complete data frame
do.call(rbind,lapply(split(Data,Data$belongID),getRating)) 
     # edited per Tyler's suggestion'
4

4 回答 4

4

Here's my crack at it (interesting problem):

Reading in your data:

m <- read.table(text="belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 ", header=T)

Extracting the rows you asked for:

m2 <- m[order(m$belongID, -m$Time), ]                 #Order to get max time first
LIST <- split(m2, m$belongID)                         #split by belongID
FUN <- function(x) which(cumsum(x[, 'Rating'])!=0)[1] #find first non zero Rating
LIST2 <- lapply(LIST, function(x){                    #apply FUN; if NA do 1st row
        if (is.na(FUN(x))) {
            x[1, ]
        } else {
            x[FUN(x), ]
        }
    }
)
do.call('rbind', LIST2)                              #put it all back together

Which yields:

  belongID uniqID Time Rating
1        1    101    5      0
2        2    105    2      5
3        3    108    5      1

EDIT With so many people answering this problem (fun to solve IMHO) it begged for a microbenchmark test (Windows 7):

Unit: milliseconds
    expr       min        lq    median        uq      max
1   JIGR  6.356293  6.656752  7.024161  8.697213 179.0884
2 JORRIS  2.932741  3.031416  3.153420  3.552554 246.9604
3  PETER 10.851046 11.459896 12.358939 17.164881 216.7284
4  TYLER  2.864625  2.961667  3.066174  3.413289 221.1569

And a graph:

enter image description here

于 2012-05-10T13:16:30.417 回答
3

Here's a solution that uses data.table for ease of filtering and performing my function getRecentRow separately for each belongID.

library(data.table)

# Load the data from the example.
dat = structure(list(belongID = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), 
          uniqID = 101:108, Time = c(5L, 4L, 4L, 3L, 2L, 4L, 5L, 5L),
          Rating = c(0L, 0L, 0L, 0L, 5L, 2L, 0L, 1L)), 
          .Names = c("belongID", "uniqID", "Time", "Rating"),
          row.names = c(NA, -8L), class = c("data.table", "data.frame"))

dat = data.table(dat) # Convert to data table.

# Function to get the row for a given belongID
getRecentRow <- function(data) {
    # Filter by Rating, then order by time, then select first.
    row = data[Rating != 0][order(-Time)][1]

    if(!is.na(row$uniqID)) {
        # A row was found with Rating != 0, return it.
        return(row)
     } else {
          # The row was blank, so filter again without restricting. rating.
          return(data[order(-Time)][1])
        }  
}

# Run getRecentRow on each chunk of dat with a given belongID
result = dat[,getRecentRow(.SD), by=belongID]

     belongID uniqID Time Rating
[1,]        1    101    5      0
[2,]        2    105    2      5
[3,]        3    108    5      1
于 2012-05-10T13:09:59.727 回答
3

One suggestion would be:

library(plyr)

maxV <- function(b) {
    if (b[which.max(b$Time), "Rating"]  != 0) {
        return(b[which.max(b$Time), ])
    } else if (!all(b$Rating==0)) {
        bb <- b[order(b$Rating), ]
        return(bb[bb$Rating != 0,][1, ])
    } else {
        return(b[which.max(b$Time),])
    }
}

a <- read.table(textConnection(" belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 "), header=T)

ddply(a, .(belongID), maxV)
  belongID uniqID Time Rating
1        1    101    5      0
2        2    105    2      5
3        3    108    5      1
于 2012-05-10T13:13:51.970 回答
3

EDIT :

As speed is your main concern, I edited my trick into your initial solution, which results in something like this:

uniqueMax <- function(m, belongID = 1, time = 3) {
  t(
    vapply(
      split(1:nrow(m), m[,belongID]), 
      function(i, x, time){ 
        is.zero <- x[i,'Rating'] == 0
        if(all(is.zero)) is.zero <- FALSE
        x[i, , drop=FALSE][which.max(x[i,time]*(!is.zero)),]
      }
      , m[1,], x=m, time=time
      )
    )
}

My original solution, which is a bit more readible than the previous one :

# Get the rating per belongID
getRating <- function(x){
  iszero <- x$Rating == 0
  if(all(iszero)){
    id <- which.max(x$Time)
  } else {
    id <- which.max((!iszero)*x$Time) 
            # This trick guarantees taking 0 into account
  }
  x[id,]
}
# Do this over the complete data frame
do.call(rbind,lapply(split(Data,Data$belongID),getRating)) 
     # edited per Tyler's suggestion

The result :

tc <- textConnection('
belongID   uniqID   Time   Rating  
   1           101       5      0  
   1           102       4      0  
   2           103       4      0  
   2           104       3      0  
   2           105       2      5
   3           106       4      2  
   3           107       5      0  
   3           108       5      1 ')

Data <- read.table(tc,header=TRUE)

do.call(rbind,lapply(split(Data,Data$belongID),getRating)) 

to give :

  belongID uniqID Time Rating
1        1    101    5      0
2        2    105    2      5
3        3    108    5      1

EDIT : Just for fun, I did a benchmarking as well (using rbenchmark) on a small data set with 1000 replications, and a big one with 10 replications :

The outcome :

> benchmark(Joris(Data),Tyler(Data),uniqueMax(Data),
+           columns=c("test","elapsed","relative"),
+           replications=1000)
             test elapsed relative
1     Joris(Data)    1.20 1.025641
2     Tyler(Data)    1.42 1.213675
3 uniqueMax(Data)    1.17 1.000000

> benchmark(Joris(Data2),Tyler(Data2),uniqueMax(Data2),
+           columns=c("test","elapsed","relative"),
+           replications=10)
              test elapsed relative
1     Joris(Data2)    3.63 1.174757
2     Tyler(Data2)    4.02 1.300971
3 uniqueMax(Data2)    3.09 1.000000

Here I just wrapped a function Joris() and Tyler() around our solutions, and created Data2 as follows :

Data2 <- data.frame(
  belongID = rep(1:1000,each=10),
  uniqID = 1:10000,
  Time = sample(1:5,10000,TRUE),
  Rating = sample(0:5,10000,TRUE)
  )
于 2012-05-10T13:16:36.900 回答