c++ - How good is OpenCV GPU library for matrix operations?

Question

I'm using OpenCV for an application in computer vision. I'd like to accelerate some matrix operations (matrices are fairly large) on GPU and want to avoid coding directly in CUDA C, if possible. OpenCV 2.4.1 has a number of GPU accelerated functions. How well do they perform in your experience? Am I better off using another library (e.g. Thrust) instead?

EDIT Sample application: Calculate squared Euclidean distance matrix on GPU. Currently, my GPU accelerated (and vectorized) implementation in Matlab using the Parallel Computing Toolbox (PCT) is about 5-10 times faster than my C++ implementation with OpenCV.

Matlab implementation:

function K = sqEuclideanDist(P_cpu,Q_cpu)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))

P_gpu = gpuArray(P_cpu);
Q_gpu = gpuArray(Q_cpu);

[nP, d] = size(P_gpu);
[nQ, d] = size(Q_gpu);

pmag = sum(P_gpu .* P_gpu, 2);
qmag = sum(Q_gpu .* Q_gpu, 2);

% note that K is on GPU
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P_gpu*Q_gpu';

end

UPDATE Here's another Matlab implementation that accomplishes the same (thanks to https://stackoverflow.com/a/7774323/1121420). But it runs only on CPU because bsxfun is not supported by PCT. Still looking for C++ alternative though.

function K = sqEuclideanDist(P_cpu,Q_cpu)
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
% Runs on CPU only.

K = bsxfun(@plus,sum(p.^2,2),sum(q.^2,2)') - 2*(p*q');

end

score 5 · Accepted Answer

我发现ArrayFire速度更快，并且已经开始使用它而不是 OpenCV 中的 GPU 内核进行图像处理。以下是我发现的一些基准测试，将 ArrayFire（曾经位于名为 LibJacket 的不同接口）与 OpenCV 进行比较，并且在我的基准测试中也确实如此，ArrayFire 比 OpenCV 中的 GPU 函数快 2-4 倍。据我所知，NVIDIA 没有在 OpenCV 中编写 GPU 内核，而是将其外包给了某人，这可能就是它们如此缓慢的原因。由于我只使用 1 个 GPU，我可以免费使用 ArrayFire。

鉴于@Alex 发布的新 MATLAB 代码，更新： 我在我的系统上运行了此代码的基准测试。我知道 Parallel Computing Toolbox gpuArray 比 CPU 慢，但 Jacket 和 ArrayFire 却很糟糕。硬件规格为：

Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
NVIDIA Tesla M2090

使用 Parallel Computing Toolbox gpuArray（完全预热）的 CPU 与 GPU 的结果。 CPU 比 PCT gpuArray 快：

>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc;
Elapsed time is 0.006859 seconds.
>> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc;
Elapsed time is 0.005712 seconds.

使用 Jacket（完全预热）的 CPU 与 GPU 的结果。 Jacket 以 3.7 倍击败 PCT gpuArray，以 3 倍击败 CPU

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001876 seconds.

这是修改后的代码，可以让您轻松运行所有内容：

function K = sqEuclideanDist(P,Q)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))

[nP, d] = size(P);
[nQ, d] = size(Q);

pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);

K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q';

end

Jacket 确实支持 GPU 上的 BSXFUN，它确实在一定程度上提高了速度：

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001420 seconds.

请注意，此处使用的尺寸非常小，因此大多数尝试在这些小尺寸上运行的 CUDA 代码可能表现不佳。这就是为什么我喜欢使用 AccelerEyes 的东西，因为这些家伙已经优化了 GPU，不像 PCT gpuArray、Thrust、OpenCV，每一个我过去都尝试过。

这是 ArrayFire Free C++ 结果：

Time:  0.0003577 seconds
Speedups:  19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster
than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using
BSXFUN

这是我为此编写的 ArrayFire 代码：

static array SqEuclideanDist(array P, array Q)
{
    // 0 based indexing
    array pmag = sum(P * P, 1);
    array qmag = sum(Q * Q, 1);

    int np = P.dims(0);
    int nq = Q.dims(0);

    array K = tile(qmag.T(), np, 1) + tile(pmag, 1, nq) - 2 * matmul(P, Q.T());
    return K;
}

int main(int argc, char **argv)
{
    double *P_cpu = new double[1581 * 3];
    double *Q_cpu = new double[189 * 3];

    array P = array(1581, 3, P_cpu);
    array Q = array(189 , 3, Q_cpu);
    af::sync();

    int iter = 1000;

    timer::tic();
    for (int i = 0; i < iter; i++) {
        array K = SqEuclideanDist(P, Q);
        af::eval(K);
    }

    af::sync();
    printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter);

    delete[] P_cpu;
    delete[] Q_cpu;
}

score 1 · Accepted Answer

它们是由 NVidia 贡献的，因此在 CUDA 兼容卡上具有良好的性能。实际性能取决于卡本身和您使用的功能。

根据我的经验，只有 cvRotate 和 cvResize 比普通的 Intel cpu 具有更好的性能。（注：我只对图像相关的功能感兴趣）

c++ - How good is OpenCV GPU library for matrix operations?

2 回答 2

Related

Reference