1

朋友们,

我用 igraphs shortest.paths 在无向图上计算了研究人员之间的距离矩阵,得到了一个 80gb 的矩阵。下一步是熔化矩阵的上三角形。我需要在 16 个不同的图表上定义这种类型的矩阵。

当尝试在矩阵上运行任何操作(data.table::melt、lower.triangle <- NA 或 is.infinite <- NA)时,我得到“无法分配向量...” - 错误,即使运行它具有 244GiB RAM 的 Amazon AWS r3.8xlarge。

我测试过的策略:

  • 拆分矩阵和融化子矩阵:太耗时(每个矩阵 2 天)或并行时消耗内存

  • 转换成 big.matrix:太耗时(每个矩阵 1 天)

有什么想法即使我有 2-3 倍的可用内存可用,为什么我会收到错误对于这种大小的数据集使用哪种策略?

提前非常感谢!

会话信息:

ami-b1b0c3c2(来自http://www.louisaslett.com/RStudio_AMI/

RStudio 0.99.903

R 3.3.1

r3.8xlarge - 实例

我的代码(df_Network 有 2 列:AuthorID 和 ArticleID):

  # Calculate dist.matrix

  A <- spMatrix(nrow=length(unique(df_Network$AuthorID)),
                ncol=length(unique(df_Network$Articleid)),
                i = as.numeric(factor(df_Network$AuthorID)),
                j = as.numeric(factor(df_Network$Articleid)),
                x = rep(1, length(as.numeric(df_Network$AuthorID))) )
  row.names(A) <- levels(factor(df_Network$AuthorID))
  colnames(A) <- levels(factor(df_Network$id))
  Arow <- tcrossprod(A)

  # Calculate weighted relations
  g1_weighted <- graph.adjacency(Arow, weight = T)

  # Simplify graph to remove self loops and multiples

  E(g1_weighted)$weight <- count.multiple(g1_weighted)
  g1_weighted <- simplify(g1_weighted)

  distMatrix_weighted <- shortest.paths(g1_weighted, v=V(g1_weighted), to=V(g1_weighted))

  # 1st option: melt Matrix (R crashes with each of the lines)

   distMatrix_weighted[upper.tri(distMatrix_weighted)] <- NA
   distMatrix_weighted[is.infinite(distMatrix_weighted)] <- NA
   A_weighted <- data.table::melt(distMatrix_weighted, na.rm = T)

  # 2nd option: split Matrix, only take upper triangle and manually built data.table in parallel (Takes 2 days to run as I cannot take too many in parallel due to memory, too expensive to run on AWS as I am just a student)

     AllData <- foreach(j=2:(ncol(distMatrix_weighted)-1), .combine=function(x,y)rbindlist(list(x,y)), .inorder=FALSE) %do%{
                B <- data.table(Var1 = colnames(distMatrix_weighted)[j],
                Var2 = rownames(distMatrix_weighted)[1:j-1], 
                value = distMatrix_weighted[1:(j-1), j] )
                B[!is.infinite(value) & value >0]
       }

# 3rd option: split Matrix and run dt.melt in parallel (Takes too long as I cannot take too many in parallel due to memory)

 sequence <- seq(1,ncol(distMatrix_weighted), 100)

  AllData <- foreach(j=1:(length(sequence)-1), .combine=function(x,y)rbindlist(list(x,y)), .inorder=FALSE) %do%{
    m <- distMatrix_weighted[sequence[j]:sequence[j+1],]
    m[!is.finite(m)] <- NA
    data.table::melt(m, na.rm =T)
  }
  # 4th option: bigmemory package to store and then run multiple in parallel. Conversion to big.matrix is running since one day with no output

  distMatrix_weighted <- as.big.matrix(distMatrix_weighted)
  ...

链接到示例数据: https ://www.dropbox.com/s/eaud2np33e5y6iv/df_network_2000-2005.RData?dl=0

4

0 回答 0