r - 在 R 中加速大量 GLM

Question

首先，对不起，很长的帖子。认为最好提供上下文以获得好的答案（我希望！）。前段时间我写了一个 R 函数，它将获取数据框中变量的所有成对交互。这在当时工作得很好，但现在一位同事希望我用更大的数据集来做这件事。他们不知道最终会有多少变量，但他们猜测大约有 2,500 - 3,000 个。我下面的函数太慢了（100 个变量需要 4 分钟）。在这篇文章的底部，我列出了各种变量数量和交互总数的一些时间安排。我有在我的函数运行的 100 个变量上调用 Rprof() 的结果，所以如果有人想看看它，请告诉我。我不想让超长的时间超过它需要的时间。

我想知道的是我是否可以做些什么来加快这个功能。我尝试直接查看 glm.fit，但据我了解，为了有用，设计矩阵的计算以及坦率地说我不理解的所有其他东西，每个模型都需要相同，我的分析并非如此，尽管我对此可能错了。

任何有关如何使此运行更快的想法将不胜感激。我计划最终使用并行化来运行分析，但我不知道我可以访问多少个 CPU，但我会说不会超过 8 个。

在此先感谢，干杯
戴维。

getInteractions2 = function(data, fSNPcol, ccCol)
{
#fSNPcol is the number of the column that contains the first SNP
#ccCol is the number of the column that contains the outcome variable
  require(lmtest)
  a = data.frame()
  snps =  names(data)[-1:-(fSNPcol-1)]
  names(data)[ccCol] = "PHENOTYPE"
  terms = as.data.frame(t(combn(snps,2)))
  attach(data)

  fit1 = c()
  fit2 = c()
  pval  = c()

  for(i in 1:length(terms$V1))
  {
    fit1 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")
    fit2 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")
    a  = lrtest(fit1, fit2)
    pval = c(pval, a[2,"Pr(>Chisq)"])
  }

  detach(data)
  results = cbind(terms,pval) 
  return(results)
}

下表是通过函数传递的变量数量增加的 system.time 结果。n 是数量，Ints 是由该数量的变量给出的成对交互的数量。

      n   Ints     user.self sys.self elapsed
time  10   45      1.20     0.00    1.30
time  15  105      3.40     0.00    3.43
time  20  190      6.62     0.00    6.85
... 
time  90 4005    178.04     0.07  195.84
time  95 4465    199.97     0.13  218.30
time 100 4950    221.15     0.08  242.18

如果您想查看时序或 Rprof() 结果，请使用一些代码来重现数据帧。请不要运行此程序，除非您的机器速度非常快，或者您准备等待大约 15-20 分钟。

df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))
gtypes = matrix(nrow=2000, ncol=3000)
gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})
snps = paste("rs", 1000:3999,sep="")
df = cbind(df,gtypes)
names(df) = c("sid", "status", snps)

times = c()
for(i in seq(10,100, by=5)){
  if(i==100){Rprof()}
  time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))
  print(time)
  times = rbind(times, time)
  if(i==100){Rprof(NULL)}
}
numI = function(n){return(((n^2)-n)/2)}
timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)

score 0 · Accepted Answer

所以我已经解决了这个问题（在 R 邮件列表的帮助下），并将其发布，以防它对任何人都有用。

基本上，在 SNP 或变量是独立的（即不在 LD 中，不相关）的情况下，您可以将每个 SNP/变量居中，如下所示：

rs1cent <- rs1-mean(rs1)
rs2cent <- rs2 -mean(rs2)

然后，您可以测试表型和相互作用之间的相关性作为筛选步骤：

rs12interaction <- rs1cent*rs2cent
cor(PHENOTYPE, rs12interaction)

然后使用完整的 glm 对任何似乎相关的内容进行全面调查。与以往一样，截止选择是任意的。

其他建议是使用 RAO 分数测试，它只涉及拟合零假设模型，这将这一步的计算时间减半，但我并不真正理解这是如何工作的（但是！需要更多阅读。）

无论如何，你去。也许有一天对某人有用。

r - 在 R 中加速大量 GLM

1 回答 1

Related

Reference