r - 用于引导分析的模拟数据集

Question

我的目标是使用自举（1000 次重复）来计算 r（皮尔逊相关系数）的空分布、均值和 CI，这些与从我的 600 个唯一个体 (ID) 的数据集生成的 20 个受激随机对中的特征 (x) 相关。我最近从 SAS 切换到 R，我将使用“procsurveyselect”来生成数据集。问题：

产生这些结果的最有效方法是什么（见下面我的尝试）？
在我的示例中，我将如何使用 set.seed 命令来复制我的结果？

具有 600 个人和相关特征值的模拟起始数据集：

ID <- seq(1, 600, by = 1)
x <- rnorm(600, m = 7, sd = 2)
X <- as.data.frame(cbind(ID, x))

然后我生成 r 的 1000 次重复并计算 95% CI：

for (i in 1:1000) { 
  X.sample <- X[ sample(1:nrow(X), 40, replace = FALSE), ] 
  X.sample.1 <- X.sample[1:20, ]
  X.sample.2 <- X.sample[21:40, ]
  Y <- as.data.frame(cbind(X.sample.1$ID, X.sample.1$x, X.sample.2$ID,  X.sample.2$x))
  cor.results <- cor.test(Y[,2], Y[,4], alternative = c("greater"), method = c("pearson"))
  Z[i] <- cor.results$estimate
}

error <- qt(0.975, df = (length(Z) - 1)) * (sd(Z))/sqrt(length(Z))

score 1 · Accepted Answer

试穿这个尺寸：

# generate dataset
set.seed(1)
X <- rnorm(600, 7, 2)

# Create a function that samples 40 elements from X,
#  and calculates Pearson's r for the first 20 elements 
#  against the last 20 elements.
booties <- function(x) {
  X.samp <- sample(x, 40)
  cor(X.samp[1:20], X.samp[21:40])
}

# Replicate this function 1000 times (spits out a vector of cor estimates)
Z <- replicate(1000, booties(X))
error <- qt(0.975, length(Z)-1 * sd(Z)/sqrt(length(Z)))

for完成 1000 次重复大约需要 0.08 秒（比您正在试验的循环快大约一个数量级）。

score 0 · Accepted Answer

通常，隐式循环比显式循环更快。尝试将代码放入循环中并将其放入函数中，然后在 lapply 或 sapply 语句中使用该函数。

myfunction = function(<insert relevant parameters here>)
{ 
  X.sample <- X[ sample(1:nrow(X), 40, replace = FALSE), ] 
  X.sample.1 <- X.sample[1:20, ]
  X.sample.2 <- X.sample[21:40, ]
  Y <- as.data.frame(cbind(X.sample.1$ID, X.sample.1$x, X.sample.2$ID,  X.sample.2$x))
  cor.results <- cor.test(Y[,2], Y[,4], alternative = c("greater"), method = c("pearson"))
  cor.results$estimate
}

Z  = sapply(x, myfunction)
#Here every element of x contains the arguments you want to pass to my function
#You can pass multiple arguments separated by commas after the function name

error <- qt(0.975, df = (length(Z) - 1)) * (sd(Z))/sqrt(length(Z))

你可以这样做，但我发现如果可以的话，最好只使用包boot()中的函数boot。

至于set.seed()您需要在每次生成随机任何东西之前直接设置它。见下文。

> rnorm(6)
[1]  1.0915017 -0.6229437 -0.9074604 -1.5937133  0.3026445  1.6343924
> set.seed(1001)
> rnorm(6)
[1]  2.1886481 -0.1775473 -0.1852753 -2.5065362 -0.5573113 -0.1435595
> set.seed(1001)
> rnorm(6)
[1]  2.1886481 -0.1775473 -0.1852753 -2.5065362 -0.5573113 -0.1435595
> rnorm(6)
[1]  1.0915017 -0.6229437 -0.9074604 -1.5937133  0.3026445  1.6343924


> set.seed(1001)
> sample(1:5,10,replace=T)
 [1] 5 3 3 3 3 5 1 1 2 4
> sample(1:5,10,replace=T)
 [1] 3 1 5 3 2 5 1 2 1 4
> set.seed(1001)
> sample(1:5,10,replace=T)
 [1] 5 3 3 3 3 5 1 1 2 4
> rnorm(6)
[1] -0.1435595  1.0915017 -0.6229437 -0.9074604 -1.5937133  0.3026445
> set.seed(1001)
> rnorm(6)
[1]  2.1886481 -0.1775473 -0.1852753 -2.5065362 -0.5573113 -0.1435595

希望有帮助！

在研究这个boot函数给你一个例子时，我遇到了一个障碍。它只返回一行。奇怪的！我可能会对此提出一个新问题。无论如何，我认为包bootstrap()中的功能bootstrap将满足您的需求。这是我的例子

set.seed(1001)
X <- rnorm(600, 7, 2)


myStat <- function(x, pairs) {
index = sample(1:length(x),(pairs*2))
Z = cor(X[index[1:(length(index)/2)]], X[index[((length(index)/2)+1):length(index)]])
return(Z)
}

b=bootstrap(X,1000,myStat,pairs=20)
Z <- b$thetastar
error <- qt(0.975, length(Z)-1 * sd(Z)/sqrt(length(Z)))

r - 用于引导分析的模拟数据集

2 回答 2

Related

Reference