r - 在 R 中管理大量组合的最佳方法

Question

我希望得到一些关于在 R 中管理大量组合的建议。

我是一名植物育种研究生，试图计算植物种群中 40 个亲本的各种组合的后代的最高平均值。我首先创建一个矩阵，其中包含通过交叉这些父项获得的值：

# fake data
B <- matrix (data=runif(1600, 1.0, 5.0),ncol=40,nrow=40)
diag(B) <- diag(B) - 1 # diagonals are when plants are crossed to themselves and suffer inbreeding depression

我通过找到包含各种父母组合的矩阵子集（“perse.hybrid”）的平均值来做到这一点：

SubsetWright <- function (perse.hybrid, subset) {
  return (mean(perse.hybrid[subset,subset]))
}

理想情况下，我想找到 40 个父母的所有组合的后代值，但更现实一点，我想找到 2 到 11 个父母组合的值。这大约是 35 亿个组合。

我一直在努力加快速度并管理内存。为了加快速度，我将其设置为在 Amazon EC2 集群（通常是 3 个 m4.10xlarge 机器）上并行运行任务。为了解决内存挑战，我尝试将数据保存在 big.matrix 中。但是，我似乎遇到了combn。通常当我达到 40 选择 8 时，它会崩溃。看 htop，我相信那是因为内存使用。

我是新手 R 用户，并不完全了解 R 中的内存管理方式。如果我能以某种方式拆分 combn 函数，我似乎可以获得这些限制，这可能允许我并行运行它并避免内存限制。或者也许有一种方法可以在不使用 combn 的情况下用所有组合填充 big.matrix。有没有人有任何建议的策略？代码如下。太感谢了！

#' Test all combinations of parents to find set of offspring with highest estimated mean.
#'
#' @param perse.hybrid  A matrix of offspring values, with row[i]=col[j]=parent ID 
#' @param min The minimum number of parents to test combinations of
#' @param max The maximum number of parents to test combinations of
#' @param rows Number of rows of top combinations to return, default is to return all rows
#' @param cl cluster to use
#' @return A big.matrix with parent combinations and predicted average offspring values 
TestSyn <- function (perse.hybrid, min, max, rows="all", cl) {

      clusterExport(cl, list("SubsetWright"))

      total <- sum(apply(X=array(min:max),MARGIN=1,FUN=choose,n=nrow(perse.hybrid)))
      n <- nrow(perse.hybrid)
      start <- 1
      stop <- choose(n,min)

      syn.data <- big.matrix(nrow=total,ncol=max+1)

      for (i in min:max)
      {

        #add inbred numbers to syn.data. This seems to be what crashes when i gets large (>7)
        inbreds <- t(combnPrim(1:n,i))
        syn.data[start:stop,1:i] <- inbreds

        #add sythetic values to syn.data
        syn.data[start:stop,max+1]  <- parApply(cl=cl,X=inbreds,MARGIN=1,FUN=SubsetWright,perse.hybrid=perse.hybrid)

                     start <- stop + 1
                     stop <- start + choose(n,i+1) - 1

      }

      # sort by offspring average
      mpermute(x=syn.data,cols=max+1,decreasing=TRUE)

      if (rows == "all") rows <- nrow(syn.data)

      return (syn.data[1:rows,])
    }

编辑：

随着更多的调查，看起来组合可能成为解决方案的一部分。Stackoverflow 在这里发布：使用 combn() 和 bigmemory 包生成一个非常大的字符串组合矩阵

我会做一些测试，看看这是否适用于大量数字。

r - 在 R 中管理大量组合的最佳方法

0 回答 0

Related

Reference