r - 使用并行化在 R 中创建距离矩阵

Question

我有 N 个包含推文累积频率的向量，为澄清起见，这些向量之一希望 (0, 0, 1, 1, 2, 3, 4, 4, 5, 5, 6, 6, ...)

我想通过创建热图来可视化这些频率的差异。为此，我首先想创建一个包含推文之间欧几里德距离的 NxN 矩阵。我的第一种方法很像 Java，看起来像这样：

create_dist <- function(x){
  n <- length(x)                             #number of tweets
  xy <- matrix(nrow=n, ncol=n)               #create NxN matrix
  colnames(xy) <- names(x)                   #set column
  rownames(xy) <- names(x)                   #and row names

  for(i in 1:n) {
    for(j in 1:n){
      xy[i,j] <- distance(x[[i]], x[[1]])    #calculate euclidean distance for now, but should be interchangeable 
    }
  }

  xy
}

我测量了创建这个距离矩阵所需的时间，对于一个小样本（大约两千条推文），它已经花费了大约 35 秒。

> system.time(create_dist(cumFreqs))
user  system elapsed 
34.572   0.000  34.602

现在我考虑如何稍微加快计算速度，因为我的计算机有 8 个内核，我想如果我使用并行化可能会更快。

像 R 新手一样，我将内部 for 循环更改为 foreach 循环。

#libraries
library(foreach)
library(doMC)
registerDoMC(4)

create_dist <- function(x){
  n <- length(x)                                #number of tweets
  xy <- matrix(nrow=n, ncol=n)                  #create NxN matrix
  colnames(xy) <- names(x)                      #set column
  rownames(xy) <- names(x)                      #and row names

  for(i in 1:n) {
    xy[i,] <- unlist(foreach(j=1:n) %dopar% {   #set each row of the matrix
      distance(x[[i]], x[[j]])
    })
  }

  xy
}

我想再次测量使用 system.time() 为两千条推文样本创建距离矩阵所需的时间，但我在 10 分钟后取消了执行，因为显然根本没有加速。

我搜索了解决方案，但不幸的是我没有找到任何解决方案。现在我想问你是否有更好的方法来创建这个距离矩阵，也许是一个应用函数，我没有羞耻地承认仍然让我感到困惑。

score 2 · Accepted Answer

如前所述，您可以使用dist功能。这是一个如何使用结果dist创建热图的示例。

nn <- paste0('row',1:5)
x <- matrix(rnorm(25), nrow = 5,dimnames=list(nn))
distObj <- dist(x)
cols <- c("#D33F6A", "#D95260", "#DE6355", "#E27449", 
            "#E6833D", "#E89331", "#E9A229", "#EAB12A", "#E9C037", 
            "#E7CE4C", "#E4DC68", "#E2E6BD")
## mandatory coercion
distObj <- as.matrix(distObj)
## hetamap
image(distObj[order(nn), order(nn)], col = cols, 
      xaxt = "n", yaxt = "n")
## axes labels
axis(1, at = seq(0, 1, length.out = dim(distObj)[1]), labels = nn, 
     las = 2)
axis(2, at = seq(0, 1, length.out = dim(distObj)[1]), labels = nn, 
     las = 2)

在此处输入图像描述

score 0 · Accepted Answer

就像 'agstudy' 建议的那样，使用内置的 'dist' 函数。

为了将来参考，R 中的嵌套 for 循环非常慢。由于 R 是一种函数式语言，因此请尝试将矢量化操作与 apply 系列（apply、lapply、sapply、tapply）等函数一起使用。当您习惯于类似 C 的范式时，需要一些时间来考虑以函数式方式编写任务。

关于 for 循环和应用风格之间的基准的有用讨论在这里：R 的应用系列不仅仅是语法糖吗？

r - 使用并行化在 R 中创建距离矩阵

2 回答 2

Related

Reference