12

I've been wondering about this question for quite a while but cannot find a reference: How does Matlab transpose a sparse matrix so fast, given that it is stored in CSC (compressed sparse column) format?

Also its documentation verifies the efficiency of sparse matrix transposition:

To do this (accessing row by row), you can transpose the matrix, perform operations on the columns, and then retranspose the result … The time required to transpose the matrix is negligible.

Follow-up (modified as suggested by @Mikhail):

I agree with @Roger and @Milhail that setting a flag is enough for many operations such as the BLAS or sparse BLAS operations in terms of their interfaces. But it appears to me that Matlab does "actual" transposition. For example, I have a sparse matrix X with size m*n=7984*12411, and I want to scale each column and each row:

% scaling each column
t = 0;
for i = 1 : 1000
    A = X; t0 = tic;
    A = bsxfun(@times, A, rand(1,n));
    t = t + toc(t0);
end

t = 0.023636 seconds

% scaling each row
t = 0;
for i = 1 : 1000
    A = X; t0 = tic;
    A = bsxfun(@times, A, rand(m,1));
    t = t + toc(t0);
end

t = 138.3586 seconds

% scaling each row by transposing X and transforming back
t = 0;
for i = 1 : 1000
    A = X; t0 = tic;
    A = A'; A = bsxfun(@times, A, rand(1,m)); A = A';
    t = t + toc(t0);
end

t = 19.5433 seconds

This result means that accessing column by column is faster than accessing row by row. It makes sense because sparse matrices are stored column by column. So the only reason for the fast speed of column scaling of X' should be that X is actually transposed to X' instead of setting a flag.

Also, if every sparse matrix is stored in CSC format, simply setting a flag cannot make X' in CSC format.

Any comments? Thanks in advance.

4

3 回答 3

9

经过一周的探索,我对转置矩阵的内部机制的猜测是排序。

假设A是一个稀疏矩阵,

[I, J, S] = find(A);
[sorted_I, idx] = sort(I);
J = J(idx);
S = S(idx);
B = sparse(J, sorted_I, S);

然后B是 的转置A

transpose上述实现的效率大约是我机器上内置的 Matlab 的一半。考虑到 Matlab 的内置函数是多线程的,我的猜测可能是合理的。

于 2013-06-04T22:27:15.207 回答
3

我意识到我玩游戏有点晚了,但我想我可以帮助阐明这个问题。转置稀疏矩阵实际上是一项简单的任务,可以在与输入矩阵中非零元素的数量成比例的时间内完成。假设A是一个以CSC格式存储的mxn矩阵,即A由三个数组定义:

  1. elemsA,长度为 nnz(A),将非零元素存储在 A
  2. prowA,长度为 nnz(A),存储 A 中非零元素的行索引
  3. pcolA,长度为 n + 1,使得 A 的第 j 列中的所有非零元素都由范围 [pcolA(j), pcolA(j + 1)) 索引

如果 B 表示 A 的转置,那么我们的目标是定义类似的数组 elemsB、prowB、pcolB。为此,我们使用 A 的行形成 B 的列这一事实。令 tmp 是一个数组,使得 tmp(1) = 0 并且 tmp(i + 1) 是 A 的第 i 行中的元素数我 = 1,...,米。那么由此得出 tmp(i + 1) 是 B 的第 i 列中的元素个数。因此,tmp 的累积和与 pcolB 相同。现在假设 tmp 已被其累积和覆盖。然后 elemsB 和 prowB 可以填充如下

    for j = 1,...,n
        for k = pcolA(j),...,pcolA(j + 1) - 1
            prowB(tmp(prowA(k))) = j
            elemsB(tmp(prowA(k))) = elemsA(k)
            tmp(prowA(k)) = tmp(prowA(k)) + 1
        end
    end

数组 tmp 用于在添加新元素时对 prowB 和 elemsB 进行索引,然后进行相应更新。总而言之,我们可以用 C++ 编写一个实现转置算法的 mex 文件:

#include "mex.h"
#include <vector>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {     
    // check input output
    if (nrhs != 1)
       mexErrMsgTxt("One input argument required");
    if (nlhs > 1)
       mexErrMsgTxt("Too many output arguments");

    // get input sparse matrix A
    if (mxIsEmpty(prhs[0])) { // is A empty?
        plhs[0] = mxCreateSparse(0, 0, 0, mxREAL);
        return;
    }
    if (!mxIsSparse(prhs[0]) || mxIsComplex(prhs[0])) // is A real and sparse?
       mexErrMsgTxt("Input matrix must be real and sparse");
    double* A = mxGetPr(prhs[0]);           // real vector for A
    mwIndex* prowA = mxGetIr(prhs[0]);      // row indices for elements of A
    mwIndex* pcolindexA = mxGetJc(prhs[0]); // index into the columns
    mwSize M = mxGetM(prhs[0]);             // number of rows in A
    mwSize N = mxGetN(prhs[0]);             // number of columns in A

    // allocate memory for A^T
    plhs[0] = mxCreateSparse(N, M, pcolindexA[N], mxREAL);
    double* outAt = mxGetPr(plhs[0]);
    mwIndex* outprowAt = mxGetIr(plhs[0]);
    mwIndex* outpcolindexAt = mxGetJc(plhs[0]);

    // temp[j + 1] stores the number of nonzero elements in row j of A
    std::vector<mwSize> temp(M + 1, 0); 
    for(mwIndex i = 0; i != N; ++i) {
        for(mwIndex j = pcolindexA[i]; j < pcolindexA[i + 1]; ++j)
            ++temp[prowA[j] + 1];
    }
    outpcolindexAt[0] = 0;
    for(mwIndex i = 1; i <= M; ++i) {
        outpcolindexAt[i] = outpcolindexAt[i - 1] + temp[i];
        temp[i] = outpcolindexAt[i];
    }
    for(mwIndex i = 0; i != N; ++i) {
        for(mwIndex j = pcolindexA[i]; j < pcolindexA[i + 1]; ++j) {
            outprowAt[temp[prowA[j]]] = i;
            outAt[temp[prowA[j]]++] = A[j];
        }
    }
}

将此算法与 Matlab 的转置实现进行比较,我们观察到相似的执行时间。请注意,可以直接修改此算法以消除临时数组。

于 2015-08-24T19:46:21.167 回答
1

我同意 Roger Rowland 在评论中提到的内容。为了支持这个建议,您可以检查 BLAS 接口中的一些函数,MATLAB 使用该接口进行矩阵运算。我不确定它使用什么实现,但由于他们使用英特尔 IPP 进行图像处理,我想他们不妨使用英特尔 MKL 来快速进行矩阵运算。

这是mkl_?cscsv函数的文档,它为 CSC 格式的稀疏矩阵求解线性方程组。请注意transa输入标志,它明确定义是否应将提供的矩阵视为转置矩阵。

于 2013-05-23T19:24:13.610 回答