r - 分层/多级数据的引导重采样

Question

我正在尝试对多级/分层数据集进行引导重采样。观察结果是聚集在医院内的（独特的）患者。

我的策略是依次从每家医院的患者中进行替换抽样，这将确保样本中代表所有医院，并且当重复时，所有样本的大小都是相同的。这是方法 2这里。

我的代码是这样的：

hv <- na.omit(unique(dt$hospital))

samp.out <- NULL

for (hosp in hv ) {
    ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]
    ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),]
    samp.out <- rbind(samp.out,ss2)
}

这似乎可行（尽管如果有人能看到任何问题，我将不胜感激）。

问题是它很慢，所以我想知道是否有办法加快速度。

更新：

我试图实现 Ari B. Friedman 的答案，但没有成功 - 所以我稍微修改了它，目的是构建一个向量，然后索引原始数据帧。这是我的新代码：

# this is a vector that will hold unique IDs
v.samp <- rep(NA, nrow(dt))

#entry to fill next
i <- 1

for (hosp in hv ) {
    ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]

    # column 1 contains a unique ID
    ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),1]
    N.fill <- length(ss2)
    v.samp[ seq(i,i+N.fill-1) ] <- ss2

    # update entry to fill next
    i <- i + N.fill
}

samp.out <- dt[dt$unid %in% v.samp,]

这很快！但是，它无法正常工作，因为它只选择v.samp最后一行中的唯一 ID，但是采样是有替换的，所以在v.samp. 任何进一步的帮助将不胜感激

score 2 · Accepted Answer

加快 bootstrap 的常用技巧是一次为每个医院抽取整个样本（所有重复），然后将它们分配给重复。这样你ss1<-每家医院只跑一次。您可以通过不对每家医院进行子集化来改善这一点。另一个巨大的胜利可能来自预分配而不是rbinding。更多关于速度改进的建议。

要重新分配，请计算您需要多少条目（调用它N.out）。然后，就在你的循环之前，添加：

samp.out <- rep(NA, N.out)

并将您的rbind行替换为：

samp.out[ seq(i,i+N.iter) ] <- ss2

i您计算的第一个条目尚未填写在哪里，并且i+N.iter是您在本轮中有数据要填写的最后一个条目。

有关更多详细信息和技巧，请参阅 R Inferno。

更新

You have two approaches and you're mixing them. You can either make v.samp a data.frame and just sample all the rows into it in real-time, or you can sample IDs, and then select a data.frame using the vector of IDs outside of the loop. The key to the latter is that myDF[c(1,1,5,2,3),] will give you a data.frame which repeats the first row--exactly what you want, and exactly what that feature was designed for. Make sure v.samp is an ID that you can select from a data.frame on (either a row number or a row name), then select outside the loop.

r - 分层/多级数据的引导重采样

1 回答 1

Related

Reference