matlab - 从 MATLAB 低性能调用内核

Question

我从我的 MATLAB 实现中调用了一个 CUDA 内核；但是我的 CPU 结果比我的 GPU 实现要快。我知道更大的矩阵可以获得更好的性能，但是当我也尝试使用大尺寸时，我会得到较低的 GPU 性能。

结果是： CPU：0.000006 GPU：0.00134 我的内核和 MATLAB 代码如下：

提前致谢！

matrix.cu

__global__ void matrix_mult2(double *A, double *B, double * C) {
   int x =  threadIdx.x;

C[x] = A[x] * B[x];


}



main.m
kernel = parallel.gpu.CUDAKernel( 'matrix_mult2.ptx', ...
                              'matrix_mult2.cu' );


kernel.ThreadBlockSize = [25,1,1];
kernel.GridSize = [1,1];


A = parallel.gpu.GPUArray.rand(5,5,'double');
B = parallel.gpu.GPUArray.rand(5,5,'double');
C = parallel.gpu.GPUArray.zeros(5,5);

C = feval(kernel,A,B,C);

score 1 · Accepted Answer

You need to give the GPU some real work to do. In your current example, the only time-consuming operations are copying the data to the GPU and back. As the CPU doesn't have to perform these steps, it has an obvious advantage here. Try e.g. a real matrix multiplication of large matrices (not an element wise multiplication).

In slightly more formal terms, your kernel is PCIe bandwidth bound. To amortize the time spent copying N elements forth and back, you need to do some operations that are a lot more expensive than the data copying. Elementwise multiplication is cheap and scales linearly with N. Multiplication of N×N-matrices scales with N³ while the data transfer only scales with N², so for large enough matrices matrix multiplication on the GPU will be faster than on the CPU.

matlab - 从 MATLAB 低性能调用内核

1 回答 1

Related

Reference