r - 将 t.test 应用于大型矩阵的每一列的最快方法是什么？

Question

假设我有一个大矩阵：

M <- matrix(rnorm(1e7),nrow=20)

进一步假设每一列代表一个样本。假设我想申请t.test()每一列，有没有比使用快得多的方法apply()？

apply(M, 2, t.test)

在我的电脑上运行分析只用了不到 2 分钟：

> system.time(invisible( apply(M, 2, t.test)))
user  system elapsed 
113.513   0.663 113.519

score 9 · Accepted Answer

colttests您可以使用包中的功能genefilter（在 Bioconductor 上）做得比这更好。

> library(genefilter)
> M <- matrix(rnorm(40),nrow=20)
> my.t.test <- function(c){
+   n <- sqrt(length(c))
+   mean(c)*n/sd(c)
+ }
> x1 <- apply(M, 2, function(c) my.t.test(c))
> x2 <- colttests(M, gl(1, nrow(M)))[,"statistic"]
> all.equal(x1, x2)
[1] TRUE
> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
   user  system elapsed 
 27.386   0.004  27.445 
> system.time(invisible(colttests(M, gl(1, nrow(M)))[,"statistic"]))
   user  system elapsed 
  0.412   0.000   0.414

参考：“在 R 中同时计算数千个测试统计数据”，SCGN，第 18 (1) 卷，2007 年，http ://stat-computing.org/newsletter/issues/scgn-18-1.pdf 。

score 8 · Accepted Answer

如果您有一台多核机器，则使用所有内核会有一些好处，例如使用mclapply.

> library(multicore)
> M <- matrix(rnorm(40),nrow=20)
> x1 <- apply(M, 2, t.test)
> x2 <- mclapply(1:dim(M)[2], function(i) t.test(M[,i]))
> all.equal(x1, x2)
[1] "Component 1: Component 9: 1 string mismatch" "Component 2: Component 9: 1 string mismatch"
# str(x1) and str(x2) show that the difference is immaterial

这个小例子表明事情按我们的计划进行。现在放大：

> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, t.test)))
   user  system elapsed 
101.346   0.626 101.859
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i]))))
  user  system elapsed 
55.049   2.527  43.668

这是使用 8 个虚拟内核。你的旅费可能会改变。不是很大的收获，但它来自很少的努力。

编辑

如果您只关心 t 统计量本身，那么提取相应的字段 ( $statistic) 会使事情变得更快，尤其是在多核情况下：

> system.time(invisible(apply(M, 2, function(c) t.test(c)$statistic)))
   user  system elapsed 
 80.920   0.437  82.109 
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i])$statistic)))
   user  system elapsed 
 21.246   1.367  24.107

甚至更快，直接计算t值

my.t.test <- function(c){
  n <- sqrt(length(c))
  mean(c)*n/sd(c)
}

然后

> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
   user  system elapsed 
 21.371   0.247  21.532 
> system.time(invisible(mclapply(1:dim(M)[2], function(i) my.t.test(M[,i]))))
   user  system elapsed 
144.161   8.658   6.313

r - 将 t.test 应用于大型矩阵的每一列的最快方法是什么？

2 回答 2

Related

Reference