matlab - CPU 和 GPU 中的 SVD 速度

Question

我正在测试svd，Matlab R2014a似乎没有CPUvsGPU加速。我正在使用GTX 460卡和Core 2 duo E8500.

这是我的代码：

%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc

此外，运行时间因运行而异，但CPU版本GPU大致相同。为什么没有加速？

这里有一些测试

for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n);
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n);
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.


for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n,'single');
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n,'single');
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.

score 9 · Accepted Answer

Generally SVD is a difficult to paralellize routine. You can check here that with a high end Tesla card, the speedup is not very impressive.

You have a GTX460 card - Fermi architecture. The card is optimized for gaming (single precision computations), not HPC (double precision computation). The Single Precision / Double Precision throughput ratio is 12. So the card has 873 GFLOPS SP / 72 GFLOPS DP. Check here.

So if the Md array uses double precision elements, then the computation on it would be rather slow. Also there's a high chance that when calling the CPU routine, all CPU cores will get utilized, reducing the possible gain of running the routine on the GPU. Plus, in the GPU run you pay time for transferring the buffer to the device.

Per Divakar's suggestion, you could use Md = single(Md) to convert your array to single precision and run the benchmark again. You can try and go with a bigger dataset size to see if something changes. I don't expect to much gain for this routine on your GPU.

Update 1:

After you posted the results, I saw that the DP/SP time ratio is 2. On the CPU side this is normal, because you can fit 2 times less double values in SSE registers. However, a ratio of only 2 on the GPU side means that the gpu code does not make best use of the SM cores - because the theoretical ratio is 12. In other words, I would have expected much better SP performance for an optimized code, compared to DP. It seems that this is not the case.

score 5 · Accepted Answer

正如 VAndrei 已经说过的，SVD 是一种难以并行化的算法。

您的主要问题是矩阵的大小。SVD 的性能随着矩阵大小的增加而迅速下降。所以你的主要目标应该是减小矩阵的大小。这可以使用高斯正规方程来完成（这基本上是最小二乘意义上的超定线性系统的简化）。

这可以通过简单地将转置乘以矩阵来完成：

MhReduced = Mh' * Mh;

这会将您的矩阵减小到 cols*cols 的大小（如果 cols 是 Mh 的列数）。然后你只需打电话[U,S,V] = svd(MhReduced);

注意：使用此方法可能会产生符号相反的奇异向量（如果您正在比较这些方法，这很重要）。

如果您的 matix 条件良好，这应该可以正常工作。然而，在病态矩阵的情况下，这种方法可能无法产生可用的结果，而直接应用 SVD 仍然可以产生可用的结果，因为 SVD 的鲁棒性。

这应该会极大地提高您的性能，至少在矩阵足够大的情况下。另一个优点是您可以使用更大的矩阵。您可能根本不必使用 GPU（因为任一矩阵太大以至于复制到 GPU 的成本太高，或者在缩减后矩阵太小以至于 GPU 的加速不会足够大）。

另请注意，如果您使用返回值，则会损失大量性能。如果您只对 SVD 计算的性能感兴趣，请不要获取任何返回值。如果您只对“解向量”感兴趣，只需获取 V（并访问最后一列）[~,~, V] = svd(Mh);：。

编辑：

我查看了您的示例代码，但我不确定它是什么，您正在计算。我也意识到很难理解我做了什么A'*A，所以我会详细解释。

给定一个具有的线性系统A*x=b，A 表示具有 m 行和 n 列的系数矩阵，x 是解向量，b 是常数向量（均具有 m 行），解可以如下计算：

如果 A 是正方形 ( m=n): x = A^-1 * b,
如果 A 不是正方形 ( m!=n, m > n)：

A * x = b

A'* A * x = A' * b

x = (A' * A)^-1 * A'*b

A" = (A'*A)^-1 * A'通常称为伪逆。然而，这种计算确实会对矩阵的条件数产生负面影响。解决这个问题的方法是使用奇异值分解 (SVD)。如果 USV = svd(A) 表示 SVD 的结果，则伪逆由给出VS"U'，其中S"通过取 S 的非零元素的逆来形成。所以A" = VS"U'。

x = A"*b

然而，由于 SVD 相当昂贵，尤其是对于大型矩阵。如果矩阵 A 条件良好且不一定需要非常精确的结果（我们说的是 1e-13 或 1e-14），则(A'*A)^-1 * A可以使用通过计算伪逆通孔的更快方法。

如果您的情况实际上是A*x=0，只需使用 SVD 并从 V 中读取最后一个列向量，这就是解决方案。

如果您使用 SVD 不是求解线性系统，而是求解 U 和 S 的结果（如您的示例所示），我不确定我发布的内容会对您有所帮助。

来源：1、2、3 _ _ _

这是一些示例代码供您测试。用大矩阵测试它，你会发现使用(A'*A)^-1 * A'比其他方法快得多。

clear all

nbRows = 30000;
nbCols = 100;
% Matrix A
A = rand(nbRows,nbCols);

% Vector b
b = rand(nbRows,1);

% A*x=b

% Solve for x, using SVD
% [U,S,V]=svd(A,0);
% x= V*((U'*b)./diag(S))
tic
[U1,S1,V1]=svd(A,0);
x1= V1*((U1'*b)./diag(S1));
toc

tic
[U1,S1,V1]=svd(A,0);
x2 = V1*inv(S1)*U1'*b;
toc

% Solve for x, using manual pseudo-inverse
% A*x=b
% A'*A*x = A'*b
% x = (A'*A)^-1 * A'*b
tic
x3 = inv(A'*A) * A'*b;
toc

% Solve for x, let Matlab decide how (most likely SVD)
tic
x4 = A\b;
toc

score 1 · Accepted Answer

问题

首先，我使用以下代码在 Matlab2016b 中复制了您的问题：

clear all
close all
clc

Nrows = 2500;
Ncols = 2500;

NumTests = 10;

h_A = rand(Nrows, Ncols);
d_A = gpuArray.rand(Nrows, Ncols);

timingCPU = 0;
timingGPU = 0;

for k = 1 : NumTests
    % --- Host
    tic
    [h_U, h_S, h_V] = svd(h_A);
%     h_S = svd(h_A);
    timingCPU = timingCPU + toc;

    % --- Device
    tic
    [d_U, d_S, d_V] = svd(d_A);
%     d_S = svd(d_A);
    timingGPU = timingGPU + toc;
end

fprintf('Timing CPU = %f; Timing GPU = %f\n', timingCPU / NumTests, timingGPU / NumTests);

通过上面的代码，可以只计算奇异值或计算包括奇异向量的完整 SVD。还可以比较 SVD 代码的 CPU 和 GPU 版本的不同行为。

时序在下表中报告（时序在s; Intel Core i7-6700K CPU @ 4.00GHz, 16288 MB, Max threads(8), GTX 960）：

              Sing. values only | Full SVD         | Sing. val. only | Full
                                |                  |                 |
Matrix size   CPU      GPU      | CPU       GPU    |                 |
                                |                  |                 |
 200 x  200   0.0021    0.043   |  0.0051    0.024 |   0.098         |  0.15
1000 x 1000   0.0915    0.3     |  0.169     0.458 |   0.5           |  2.3
2500 x 2500   3.35      2.13    |  4.62      3.97  |   2.9           |  23
5000 x 5000   5.2      13.1     | 26.6      73.8   |  16.1           | 161

第一4列是指svd例程的 CPU 和 GPU Matlab 版本之间的比较，当它仅用于计算奇异值或完整的 SVD 时。可以看出，GPU 版本可能比 GPU 版本慢很多。上面的一些答案已经指出了动机：并行化 SVD 计算存在固有的困难。

使用 cuSOLVER？

在这一点上，显而易见的问题是：我们能否获得一些加速cuSOLVER？事实上，我们可以用它mexFiles来使cuSOLVER例程在 Matlab 下运行。不幸的是，情况cuSOLVER更糟，因为它可以从上表的最后两列中推断出来。此类列报告仅使用 CUDA 计算奇异值时的代码时序，以及使用CUDA分别用于仅奇异值计算和完整 SVD 计算的多个 SVD 的并行实现。cusolverDnSgesvd可以看出，如果考虑到它处理单精度，而 Matlab 处理双精度，则cuSOLVER's 的性能甚至比 Matlab 更差。cusolverDnSgesvd

在cusolverDnCgesvd 性能与 MKL中进一步解释了这种行为的动机，其中图书馆经理 Joe EatoncuSOLVER说

我理解这里的混乱。我们确实为和分解提供了不错的加速LU，这也是我们想说的。我们的目的是首次提供密集和稀疏直接求解器作为工具包的一部分；我们必须从某个地方开始。由于不再受支持，我们认为将一些功能交到. 由于这些天在更多主机上运行，因此可以满足没有. 话虽如此，我们可以做得更好，但它必须等待下一个版本，优先级和时间表已经很紧了。QRLDL^tSVDcuSOLVERCUDACULACUDA 7.0CUDAx86CPUscuSOLVERMKLSVDCUDA

使用其他库

此时，其他可能性正在使用其他库，例如

CULA;
MAGMA;
ArrayFire.

CULA不是免费提供的，所以我没有尝试过。

我遇到了一些与MAGMA依赖项有关的安装问题，所以我没有进一步调查这一点（免责声明：我希望，再花一些时间，我就能解决这些问题）。

然后我终于使用ArrayFire.

使用ArrayFire，我有以下时间进行完整SVD计算：

 200 x  200      0.036
1000 x 1000      0.2
2500 x 2500      4.5
5000 x 5000     29

可以看出，时序略高，但现在与 CPU 情况相当。

这是ArrayFire代码：

#include <arrayfire.h>
#include <cstdio>
#include <cstdlib>
#include <fstream>

using namespace af;

int main(int argc, char *argv[])
{
    const int N = 1000;

    try {

        // --- Select a device and display arrayfire info
        int device = argc > 1 ? atoi(argv[1]) : 0;
        af::setDevice(device);
        af::info();

        array A = randu(N, N, f64);
        af::array U, S, Vt;

        // --- Warning up
        timer time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        double elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

        time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

    }
    catch (af::exception& e) {

        fprintf(stderr, "%s\n", e.what());
        throw;
    }

    return 0;
}

score 0 · Accepted Answer

我已经尝试在配备 GTX 460 的笔记本电脑上并行化 SVD 一个多月了，这也是我本科论文的一部分，我做了很多实验，后来我发现 MATLAB 速度非常快并且优于我的代码，顺便说一句，我使用了一侧的 Jacobi，我还没有看到任何论文揭示了比 MATLAB 的 svd 更快的算法。在 GPU 上，如果您不使用优雅的模型，内存复制的时间成本可能会非常高，我建议您阅读有关 CUDA 的更多信息。如果您需要任何帮助，请与我联系。

matlab - CPU 和 GPU 中的 SVD 速度

4 回答 4

编辑：

Related

Reference