r - R 模拟二项式多级数据以估计各种 ICC 的功率

Question

提前我想为这篇文章的长度道歉：S [编辑如下]

我想为具有二元结果的集群随机试验进行样本量计算。结果是患者是否使用帮助服务；Y= 使用帮助服务（Y/N）患者嵌套在诊所（五种类型）中，诊所嵌套在医院中。

我们有一个针对医院工作人员的教育计划，以促进帮助服务，试验要回答的主要问题是“教育计划是否会增加患者在治疗后的前 3 个月使用帮助热线的比例？完成教育计划？该教育计划将在“实验医院”中展开；控制医院不会做任何事情。

并非所有患者都会使用该服务，因为并非所有患者在与临床医生/护士交谈后都会有未满足的需求。不同诊所的未满足需求水平被认为是不同的（提供更严重治疗的诊所有更高的未满足需求，因此有更大的“潜在求助热线用户”群体）。

在控制站点使用帮助服务的患者的预期百分比 = 10% 在实验站点使用帮助服务的患者的预期百分比 = 20%

组内系数未知，但其他研究显示 ICC 分别为 0.09 和 0.15。我有兴趣在 0 到 0.20 的 ICC 范围内评估样本量。

目前，我假设教育计划可以立即在所有随机接受实验治疗的医院推出，并且患者只访问 1 个诊所。可用于评估计划有效性的数据是：• 三个月内每个诊所就诊的患者人数 • 三个月内使用帮助服务的患者人数我们没有个别患者数据（因为有些人在使用帮助服务时可能希望保持匿名），但汇总数据在诊所层面。

我找到了 R 包“CRTSize”，它可以计算两级数据（医院内的患者）所需的集群数量：

library(CRTSize)

pc.=0.10
pe.=0.20
ICCsteps <- c(0.0000001, 0.05,0.10, 0.15)
Msteps=c(25, 50,100,200,300)

results <- data.frame(pe=c(), pc=c(), m=c(), ICC=c(), Clusters_contr=c(),  Clusters_exp=c())

 for(i in seq_along(ICCsteps)){

for(m in seq_along(Msteps)){
print(paste("ICC=",ICCsteps[i], "m=", Msteps[m]))

mm <- n4props(pe=pe., pc=pc., m=Msteps[m], ICC=ICCsteps[i], alpha=0.05, power = 0.80, AR=1, two.tailed=TRUE, digits=3)

results_temp <- data.frame(pe=pe., pc=pc., m=mm$m, ICC=mm$ICC, Clusters_contr=ceiling(mm$nC), Clusters_exp=ceiling(mm$nE))
results <- rbind(results, results_temp)
  }
}

results
results$TotalClustersNeeded <- results$Clusters_contr +  results$Clusters_exp

par(mar=c(4,4,3,4))
plot(results$m, results$TotalClustersNeeded, type="n", xlab="Number of samples within cluster", ylab="Total number of clusters needed (half control/half intervention)", las=1);grid()
title(paste("Total # clusters needed to have 80% power with \n alfa 0.05 to show a difference between", pc.*100, "% vs. ", pe.*100, "%"))
for(i in seq_along(ICCsteps)){
points(results$m[results$ICC ==ICCsteps[i]], results$TotalClustersNeeded[results$ICC ==ICCsteps[i]], type="l")
text(max(Msteps)*0.90, min(results$TotalClustersNeeded[results$ICC ==ICCsteps[i]])+3, paste("ICC=", ICCsteps[i]), las=1)
}

所以当我每家医院有大约 200 分，ICC 为 0.15 时，我需要大约 65 家医院（一半随机对照，一半随机干预）

我不认为它可以对我的数据中的三个级别（pts/clinics/hospitals）进行建模。其他并发症将是医院的患者人数会有所不同，并且并非所有医院都会拥有所有类型的诊所，因为有些医院可能更专业。

所以我试图模拟一些数据。我不知道如何使用特定的 ICC 模拟数据，但我想我找到了解决方法（见下文）。每个对照医院的患者有 10% 的概率拨打求助热线（干预医院为 20%）。我围绕这些百分比添加了医院特定的错误。每家医院都有 1-4 个诊所。用于从二项分布中抽取 x 个样本的比例在医院内的诊所之间是相同的（x 是诊所中的患者人数）。

我以 ICC 约为 0.15 的方式调整了医院误差。但是，ICC 因每个模拟而异。我想，当我运行许多模拟时，我会在 0.145-0.155 范围内得到足够的模拟来说明 ICC = 0.15 的功率。

这是我到目前为止所做的：

 library(reshape2)
    library(lme4)
    library(plyr)
# Parameters
hospitals <- 30;h=1 # I set it up this way to be able to loop over several sample sizes, however, at the moment i am not using it. 

clinics <- data.frame(clinic =c("CT", "RT", "Surg", "Pal"),
                      prop.clin.hosp   = c(0.70, 0.80, 0.50  , 0.40),
                      Unmet.needs = c("45", "30", "20"  , "10" ), # unmet needs currently not included in simulation; assumed to be equal between clinics
                      prop.pts    = c(0.35, 0.35, 0.20  , 0.10)); clinics # most pts go through CT and RT, least through palliative care

prop.Control <- 0.5 # Proportion of hospitals allocated to control
Min.Pts.Treated.Per.Clinic <- 50 # No pts for full 6 month period for comparison (actually, its per hospital)
Max.Pts.Treated.Per.Clinic <- 300 # No pts for full 6 month period for comparison (actually, its per hospital)

Prpo_pts_calls_CONTROL <- 0.10
Prpo_pts_calls_INTERVENTION <-  0.20

N_simulations <- 100  #

Hosp_error_min <- -0.12
Hosp_error_max <- abs(Hosp_error_min)



# Empty dataframe to fill
results <- data.frame(N=c(), 
                      mean.pt.per.hosp=c(), 
                      pts_contr=c(), 
                      pts_exp=c(), 
                      total.var=c(),
                      Sf=c(),
                      Sl=c(),
                      Se=c(),
                      Sd=c(),
                      ICC=c(), 
                      p=c())


for(i in 1:N_simulations){
  cat("\r",  "Run =", i, "of", N_simulations)
# Data generation
  data<- data.frame(Hosp=c(1:hospitals[h]), 
                    CT=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "CT"]) , 
                    RT=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "RT"]) , 
                    Surg=rbinom(hospitals[h],1,clinics$prop.clin.hosp[clinics$clinic== "Surg"]) , 
                    Pal=rbinom(hospitals[h],1,clinics$prop.clin.hosp

    [clinics$clinic== "Pal"]), 
                        hosp_err= runif(hospitals[h], Hosp_error_min, Hosp_error_max))


  data$Arm <- c(rep("Control", round(hospitals[h]*prop.Control, 0)), rep("Intervention", hospitals[h] - round(hospitals[h]*prop.Control, 0))) 
  data$Pts=round(runif(hospitals[h], Min.Pts.Treated.Per.Clinic, Max.Pts.Treated.Per.Clinic), 0)
  data$CT[(data$CT+data$RT+data$Surg+data$Pal)==0] <- 1 # If no clinics assigned, assign CT
  data <- melt(data, id=c("Hosp","Arm", "Pts", "hosp_err"), variable.name="clinic")
  data <- subset(data, value==1); data <- data[-dim(data)[2]]; 
  data <- merge(data, clinics[c("clinic", "Unmet.needs", "prop.pts")], by="clinic")
  data$Pt.per.clin <- ceiling(data$Pts * data$prop.pts)

  data$simpropcalls[data$Arm == "Control"] <- Prpo_pts_calls_CONTROL
  data$simpropcalls[data$Arm == "Intervention"] <- Prpo_pts_calls_INTERVENTION
  data$simpropcalls2 <- data$simpropcalls + data$hosp_err; data$simpropcalls2[data$simpropcalls2 <0] <- 0 # Add error to introduce between hospital variation

  for(row in c(1: dim(data)[1])){data$Pts.called[row]<- rbinom(1, data$Pt.per.clin[row], data$simpropcalls2[row])}
#  for(row in c(1: dim(data)[1])){data$Pts.called2[row]<- rbinom(1, data$Pt.per.clin[row], data$simpropcalls[row])}
  data$propcalls <- round(data$Pts.called/data$Pt.per.clin , 2)

  data <- data[with(data, order(data$Hosp)), ]
  head(data, 20)

# Analysis and extraction of results
  summary(m1 <- glmer(cbind(Pts.called, Pt.per.clin-Pts.called) ~ Arm + (1|Hosp), family=binomial, data=data))

# Few lines to estimate ICC
  X <- model.matrix(m1)
  n <- nrow(X)
  Beta <- fixef(m1)
  Sf <- var(X %*% Beta) # Variance explained by fixed effect
  (Sigma.list <- VarCorr(m1))
  Sl <- as.numeric(VarCorr(m1)) # Between subject variance
  Se <- 1
  Sd <- pi^2/3
  total.var <- Sf + Sl + Se + Sd
  (ICCtemp <- Sl/total.var)

#Store results
results.temp <- data.frame(N=hospitals[h], 
                             mean.pt.per.hosp=mean(ddply(data, .(Hosp), summarize, mean.hosp.size = sum(Pt.per.clin))[[2]]),
                             pts_contr=sum(data$Pt.per.clin[data$Arm == "Control"]), 
                             pts_exp=sum(data$Pt.per.clin[data$Arm == "Intervention"]), 
                             total.var=total.var,
                             Sf=Sf,
                             Sl=Sl,
                             Se=Se,
                             Sd=Sd,
                             ICC=as.numeric(ICCtemp), 
                             p=summary(m1)$coefficients[2,4])
  results <- rbind(results, results.temp)

   } # END LOOP


# After running the code above 100s of times, 

hist(results$ICC)
results$ICC.class <- cut(results$ICC, seq(0, max(results$ICC), 0.01))
results$sign <- ifelse(results$p <0.05, 1, 0)

res <- cbind(data.frame(round(table(results$ICC.class[results$sign == 1])/table(results$ICC.class)*100, 2)),
             data.frame(table(results$ICC.class))[2]); names(res) <- c("ICC", "Power", "similations"); res # Obvisously only works with high number of simulations
# write.csv(res, paste("_Power calculation_", Prpo_pts_calls_CONTROL*100, "% vs", Prpo_pts_calls_INTERVENTION*100, "%_",hospitals[h],"hosp_hosp_err",Hosp_error_max, ".csv"))
# write.csv(results, paste("_RAW DATA_Power calculation_", Prpo_pts_calls_CONTROL*100, "% vs", Prpo_pts_calls_INTERVENTION*100, "%_",hospitals[h],"hosp_hosp_err",Hosp_error_max, ".csv"))

当我运行 10k 次模拟时，我得到以下（意外）结果：

（这是针对 10% 对 20%，N=30 家医院（15+15），医院与目标比例的具体偏差为 -0.15 到 0.15（从均匀分布中抽样）

ICC Power   simulations
(0,0.01]    NA  0
(0.01,0.02] 100 1
(0.02,0.03] 100 3
(0.03,0.04] 62.5    16
(0.04,0.05] 66.67   45
(0.05,0.06] 79.63   108
(0.06,0.07] 81.37   161
(0.07,0.08] 75.46   269
(0.08,0.09] 82.02   356
(0.09,0.1]  79.31   464
(0.1,0.11]  80.47   558
(0.11,0.12] 82.58   706
(0.12,0.13] 82.13   705
(0.13,0.14] 83.44   779
(0.14,0.15] 83.25   794
(0.15,0.16] 85.37   752
(0.16,0.17] 85.79   760
(0.17,0.18] 88.64   678
(0.18,0.19] 92.78   609
(0.19,0.2]  88.81   536
(0.2,0.21]  88.31   402
(0.21,0.22] 87.98   341
(0.22,0.23] 92.94   255
(0.23,0.24] 93.07   202
(0.24,0.25] 86.96   138
(0.25,0.26] 92.44   119
(0.26,0.27] 87.36   87
(0.27,0.28] 96.77   62
(0.28,0.29] 97.78   45
(0.29,0.3]  90  20
(0.3,0.31]  100 9
(0.31,0.32] 100 6
(0.32,0.33] 88.89   9
(0.33,0.34] 100 1

“功率”仅在明显运行许多模拟的范围内有效。例如，我们看到对于大约 0.15 的 ICC，当总共使用 30 家医院时（15 控制 +15 exp），我们将有大约 83-85% 的能力来检测控制/实验医院之间的差异。一件很清楚的事情是，功率不会随着 ICC 的增加而降低，这是我们所期望的。所以编码有问题……这与计算 ICC 的方式有关吗？ICC 的增加要么是医院间差异的增加，要么是医院内差异的减少，或两者兼而有之。

不知道如何进行……欢迎评论！

编辑 1：我想看看观察到的ICC内的力量并不公平。我应该看看真正的ICC下的力量。我想我最大的问题是如何使用固定的底层 ICC 模拟数据集......

r - R 模拟二项式多级数据以估计各种 ICC 的功率

0 回答 0

Related

Reference