1

I am working on a big dataset and have got a problem with data cleaning. My data set looks like this:

data <- cbind (group = c(1,1,1,2,2,3,3,3,4,4,4,4,4), 
               member = c(1,2,3,1,2,1,2,3,1,2,3,4,5), 
               score = c(0,1,0,0,0,1,0,1,0,1,1,1,0)) 

I just want to keep the group in which the sum of score is equal to 1 and remove the whole group in which the sum of score is equal to 0. For the group in which the sum of the score is greater than 1, e.g., sum of score = 3, I want to randomly select two group members with score equal to 1 and remove them from the group. Then the data may look like this:

newdata <- cbind (group = c(1,1,1,3,3,4,4,4), 
                  member = c(1,2,3,2,3,1,3,5), 
                  score = c(0,1,0,0,1,0,1,0)) 

Does anybody can help me get this done?

4

3 回答 3

1

我会为您编写一个结合各种操作的函数。这是一个这样的功能,大量评论:

process <- function(x) {
    ## this adds a vector with the group sum score
    x <- within(x, sumScore <- ave(score, group, FUN = sum))
    ## drop the group with sumScore == 0
    x <- x[-which(x$sumScore == 0L), , drop = FALSE]
    ## choose groups with sumScore > 1
    ## sample sumScore - 1 of the rows where score == 1L
    foo <- function(x) {
        scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
        ## which of the grups observations have score = 1L
        want <- which(x$score == 1L)
        ## want to sample all bar one of these
        want <- sample(want, scr-1)
        ## remove the selected rows & retun
        x[-want, , drop = FALSE]
    }
    ## which rows are samples with group sumScore > 1
    want <- which(x$sumScore > 1L)
    ## select only those samples, split up those samples by group, lapplying foo
    ## to each group, then rbind the resulting data frames together
    newX <- do.call(rbind,
                    lapply(split(x[want, , drop = FALSE], x[want, "group"]),
                           FUN = foo))
    ## bind the sampled sumScore > 1L on to x (without sumScore > 1L)
    newX <- rbind(x[-want, , drop = FALSE], newX)
    ## remove row labels
    rownames(newX) <- NULL
    ## return the data without the sumScore column
    newX[, 1:3]
}

与您的数据:

dat <- data.frame(group = c(1,1,1,2,2,3,3,3,4,4,4,4,4), 
                  member = c(1,2,3,1,2,1,2,3,1,2,3,4,5), 
                  score = c(0,1,0,0,0,1,0,1,0,1,1,1,0)) 

给出:

> set.seed(42)
> process(dat)
  group member score
1     1      1     0
2     1      2     1
3     1      3     0
4     3      1     1
5     3      2     0
6     4      1     0
7     4      3     1
8     4      5     0

这是我认为想要的。

更新:process()上面,内部函数foo()可以重写为仅采样 1 行并删除其他行。即用foo()下面的替换:

foo <- function(x) {
    scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
    ## which of the grups observations have score = 1L
    want <- which(x$score == 1L)
    ## want to sample just one of these
    want <- sample(want, 1)
    ## return the selected row & retun
    x[want, , drop = FALSE]
}

它们本质上是相同的操作,但foo()仅选择 1 行会使预期的行为明确;我们想从分数 == 1L 的那些中随机选择 1 行,而不是样本scr-1值。

于 2012-06-06T15:42:38.513 回答
1

我会定义一个函数来做你想做的事。然后使用ddply和拆分group

myfun <- function(x) {
  if(sum(x$score)==1) {
    return(x)
  } else if(sum(x$score)==0) {
    return(data.frame())
  } else {
    row.names(x) <- NULL
    score.1 <- sample(as.integer(row.names(x[x$score==1,])), nrow(x[x$score==1,])-1)
    return(x[-score.1,])
  }
}

library(plyr)
ddply(as.data.frame(dat), .(group), myfun)

  group member score
1     1      1     0
2     1      2     1
3     1      3     0
4     3      1     1
5     4      1     0
6     4      2     1
7     4      3     1
于 2012-06-06T15:25:53.960 回答
0
ugroups<-unique(data[,1])
scores<-sapply(ugroups,function(x){sum(data[,1]==x & data[,3]==1)})
data[data[,1]%in%ugroups[scores>0],]
....... etc

会给你每组的累积分数等

于 2012-06-06T15:27:12.830 回答