0

我有一些代码允许我从数据集中随机抽取两个样本,应用一个函数并重复该过程一定次数(请参阅下面来自相关问题的代码:How to bootstrap a function with replacement and return the output)。

示例数据:

> dput(a)
structure(list(index = 1:30, val = c(14L, 22L, 1L, 25L, 3L, 34L, 
35L, 36L, 24L, 35L, 33L, 31L, 30L, 30L, 29L, 28L, 26L, 12L, 41L, 
36L, 32L, 37L, 56L, 34L, 23L, 24L, 28L, 22L, 10L, 19L), id = c(1L, 
2L, 2L, 3L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 16L, 16L, 17L, 18L, 19L, 20L, 21L, 21L, 22L, 23L, 24L, 
25L)), .Names = c("index", "val", "id"), class = "data.frame", row.names = c(NA, 
-30L))

代码:

   library(plyr)
    extractDiff <- function(P){
      subA <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a random sample of 15 rows
      subB <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a second random sample of 15 rows
      meanA <- mean(subA$val)
      meanB <- mean(subB$val)
      diff <- abs(meanA-meanB)
      outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
      return(outdf)
    }

    set.seed(42)
    fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))

我不想抽取两个大小为 15 的随机抽取样本,而是抽取一个大小为 15 的随机抽取样本,然后在第一次随机抽取后提取数据集中剩余的 15 行(即subA等于第一个随机抽取的样本15 个 obs,subB将等于在 subA 被占用后剩余的 15 个 obs)。我真的不知道该怎么做。任何帮助将非常感激。谢谢!

4

2 回答 2

1

在这种情况下,我只需将P(存储在index下面)的行号打乱,然后选择前 15 个 forsubA和第二个 15 for subB

library(plyr)
extractDiff <- function(P){
  index <- sample(seq_len(nrow(P)),replace = FALSE)
  subA <- P[index[1:15], ] # takes a random sample of 15 rows
  subB <- P[index[16:30], ] # takes a second random sample of 15 rows
  meanA <- mean(subA$val)
  meanB <- mean(subB$val)
  diff <- abs(meanA-meanB)
  outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
  return(outdf)
}

set.seed(42)
fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))
于 2014-06-25T18:55:46.497 回答
1

我相信你可以通过对你的代码做一个小的改动来做到这一点。

extractDiff <- function(P){
  sampleset = sample(nrow(P), 15, replace=FALSE) #select the first 15 rows, note replace=FALSE
  subA <- P[sampleset, ] # takes the 15 selected rows
  subB <- P[-sampleset, ] # takes the remaining rows in the set
  meanA <- mean(subA$val)
  meanB <- mean(subB$val)
  diff <- abs(meanA-meanB)
  outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
  return(outdf)
}

但是,请注意,这与引导程序不兼容,因为引导程序需要更换。另一方面,如果您想从数据集中进行替换采样,然后从第一次采样中未选择的数据集中进行替换采样,您可以执行以下操作。

extractDiff <- function(P){
  sampleset1 = sample(nrow(P), 15, replace=TRUE) #select the first 15 rows, note replace=TRUE
  sampleset2 = sample((1:nrow(P))[-unique(sampleset1)],15,replace=TRUE) #selects only from rows not used in sampleset1
  subA <- P[sampleset1, ] # takes the 15 selected rows
  subB <- P[sampleset2, ] # takes the 15 selected rows in the remaining set set
  meanA <- mean(subA$val)
  meanB <- mean(subB$val)
  diff <- abs(meanA-meanB)
  outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
  return(outdf)
}

但是,根据您的应用程序,这仍然可能并不理想,因为第二个数据集比第一个数据集更有可能具有多个值实例。如果您选择总组的较小比例,那么问题将小得多。您最好使用“随机播放”将集合分成两部分,并从两半进行替换采样,这样两组会更加均匀,但这会阻止第一组再次成为真正的引导集。

于 2014-06-25T19:21:41.970 回答