r - as.matrix 在距离对象上非常慢；如何让它更快？

Question

我找到了一个使用多线程计算距离矩阵的 R 包Rlof，它做得很好。

但是，函数的输出distmc是向量而不是矩阵。应用于as.matrix这个“dist”对象比多线程计算距离要昂贵得多。

查看功能帮助，打印对角线和上三角形的选项在那里，但我不明白它们应该在哪里使用。

有没有办法以某种方式节省时间as.matrix？

可重现的例子：

set.seed(42)
M1 = matrix(rnorm(15000*20), nrow = 15000, ncol =20)
system.time({dA = distmc(M1, method = "euclidean", diag = TRUE,
                         upper = TRUE, p = 2)})
system.time(A = as.matrix(dA))

score 6 · Accepted Answer

返回什么`dist`？

此函数始终返回一个向量，其中包含完整矩阵的下三角部分（按列）。diagor参数只影响打印，upper即stats:::print.dist. 这个函数可以打印这个向量，就好像它是一个矩阵一样；但事实并非如此。

为什么`as.matrix`对“dist”对象不利？

很难有效地使用三角矩阵或进一步使它们在 R 核心中对称。如果您的矩阵很大lower.tri，可能会占用内存： R：将矩阵的上三角部分转换为对称矩阵。upper.tri

将“dist”对象转换为矩阵更糟糕。看代码stats:::as.matrix.dist：

function (x, ...) 
{
    size <- attr(x, "Size")
    df <- matrix(0, size, size)
    df[row(df) > col(df)] <- x
    df <- df + t(df)
    labels <- attr(x, "Labels")
    dimnames(df) <- if (is.null(labels)) 
    list(seq_len(size), seq_len(size))
    else list(labels, labels)
    df
}

和的使用row是一场噩梦。最后生成另一个大的临时矩阵对象。当内存成为瓶颈时，一切都很慢。colt"dimnames<-"

但是我们仍然可能需要一个完整的矩阵，因为它很容易使用。

尴尬的是，使用完整矩阵更容易，所以我们想要它。考虑这个例子：R - 如何从距离矩阵中获取匹配元素的行和列下标。如果我们试图避免形成完整的矩阵，则操作很棘手。

Rcpp解决方案

我编写了一个 Rcpp 函数dist2mat（请参阅此答案末尾的“dist2mat.cpp”源文件）。

该函数有两个输入：一个“dist”对象x和一个（整数）缓存阻塞因子bf。该函数首先创建一个矩阵并填充其下三角形部分，然后将下三角形部分复制到上三角形以使其对称。第二步是典型的转置操作，对于大型矩阵缓存阻塞是值得的。性能应该对缓存阻塞因子不敏感，只要它不太小或太大。128 或 256 通常是一个不错的选择。

这是我第一次尝试使用 Rcpp。我一直是使用 R 的传统 C 接口的 C 程序员。但我也在努力熟悉 Rcpp。鉴于您不知道如何编写编译代码，您可能也不知道如何运行 Rcpp 函数。你需要

安装包（如果您在 Windows 上，Rcpp不确定是否还需要Rtools ）；
将我的“dist2mat.cpp”复制到 R 当前工作目录下的文件中（您可以从getwd()R 会话中获取它）。“.cpp”文件只是一个纯文本文件，因此您可以使用任何文本编辑器创建、编辑和保存它。

现在让我们开始展示。

library(Rcpp)
sourceCpp("dist2mat.cpp")  ## this takes some time; be patient

## a simple test with `dist2mat`
set.seed(0)
x <- dist(matrix(runif(10), nrow = 5, dimnames = list(letters[1:5], NULL)))
A <- dist2mat(x, 128)  ## cache blocking factor = 128
A
#          a         b         c         d         e
#a 0.0000000 0.9401067 0.9095143 0.5618382 0.4275871
#b 0.9401067 0.0000000 0.1162289 0.3884722 0.6968296
#c 0.9095143 0.1162289 0.0000000 0.3476762 0.6220650
#d 0.5618382 0.3884722 0.3476762 0.0000000 0.3368478
#e 0.4275871 0.6968296 0.6220650 0.3368478 0.0000000

生成的矩阵保留传递给的原始矩阵/数据帧的行名dist。

您可以调整机器上的缓存阻塞因子。请注意，缓存阻塞的影响对于小矩阵并不明显。在这里，我尝试了一个 10000 x 10000 的。

## mimic a "dist" object without actually calling `dist`
n <- 10000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

system.time(A <- dist2mat(x, 64))
#   user  system elapsed 
#  0.676   0.424   1.113 

system.time(A <- dist2mat(x, 128))
#   user  system elapsed 
#  0.532   0.140   0.672 

system.time(A <- dist2mat(x, 256))
#   user  system elapsed 
#  0.616   0.140   0.759

我们可以dist2mat用as.matrix. 由于as.matrix是 RAM 消耗，我在这里使用一个小例子。

## mimic a "dist" object without actually calling `dist`
n <- 2000
x <- structure(numeric(n * (n - 1) / 2), class = "dist", Size = n)

library(bench)
bench::mark(dist2mat(x, 128), as.matrix(x), check = FALSE)
## A tibble: 2 x 14
#  expression         min   mean  median     max `itr/sec` mem_alloc  n_gc n_itr
#  <chr>         <bch:tm> <bch:> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl> <int>
#1 dist2mat(x, …   24.6ms   26ms  25.8ms  37.1ms     38.4     30.5MB     0    20
#2 as.matrix(x)   154.5ms  155ms 154.8ms 154.9ms      6.46   160.3MB     0     4
## ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
##   time <list>, gc <list>

请注意如何dist2mat更快（参见“mean”、“median”），以及它需要的 RAM 有多少（参见“mem_alloc”）。我已设置check = FALSE禁用结果检查，因为dist2mat不返回“dimnames”属性（“dist”对象没有此类信息）但as.matrix确实返回（它设置1:2000为“dimnames”），因此它们不完全相等。但是您可以验证它们是否都是正确的。

A <- dist2mat(x, 128)
B <- as.matrix(x)
range(A - B)
#[1] 0 0

“dist2mat.cpp”

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericMatrix dist2mat(NumericVector& x, int bf) {

  /* input validation */
  if (!x.inherits("dist")) stop("Input must be a 'dist' object");
  int n = x.attr("Size");
  if (n > 65536) stop("R cannot create a square matrix larger than 65536 x 65536");

  /* initialization */
  NumericMatrix A(n, n);

  /* use pointers */
  size_t j, i, jj, ni, nj; double *ptr_x = &x[0];
  double *A_jj, *A_ij, *A_ji, *col, *row, *end;

  /* fill in lower triangular part */
  for (j = 0; j < n; j++) {
    col = &A(j + 1, j); end = &A(n, j);
    while (col < end) *col++ = *ptr_x++;
    }

  /* cache blocking factor */
  size_t b = (size_t)bf;

  /* copy lower triangular to upper triangular; cache blocking applied */
  for (j = 0; j < n; j += b) {
    nj = n - j; if (nj > b) nj = b;
    /* diagonal block has size nj x nj */
    A_jj = &A(j, j);
    for (jj = nj - 1; jj > 0; jj--, A_jj += n + 1) {
      /* copy a column segment to a row segment */
      col = A_jj + 1; row = A_jj + n;
      for (end = col + jj; col < end; col++, row += n) *row = *col;
      }
    /* off-diagonal blocks */
    for (i = j + nj; i < n; i += b) {
      ni = n - i; if (ni > b) ni = b;
      /* off-diagonal block has size ni x nj */
      A_ij = &A(i, j); A_ji = &A(j, i);
      for (jj = 0; jj < nj; jj++) {
        /* copy a column segment to a row segment */
        col = A_ij + jj * n; row = A_ji + jj;
        for (end = col + ni; col < end; col++, row += n) *row = *col;
        }
      }
    }

  /* add row names and column names */
  A.attr("dimnames") = List::create(x.attr("Labels"), x.attr("Labels"));

  return A;
  }

r - as.matrix 在距离对象上非常慢；如何让它更快？

1 回答 1

返回什么dist？

为什么as.matrix对“dist”对象不利？

但是我们仍然可能需要一个完整的矩阵，因为它很容易使用。

Rcpp解决方案

“dist2mat.cpp”

Related

Reference

返回什么`dist`？

为什么`as.matrix`对“dist”对象不利？