3

我从我的 MATLAB 实现中调用了一个 CUDA 内核;但是我的 CPU 结果比我的 GPU 实现要快。我知道更大的矩阵可以获得更好的性能,但是当我也尝试使用大尺寸时,我会得到较低的 GPU 性能。

结果是: CPU:0.000006 GPU:0.00134 我的内核和 MATLAB 代码如下:

提前致谢!

matrix.cu

__global__ void matrix_mult2(double *A, double *B, double * C) {
   int x =  threadIdx.x;

C[x] = A[x] * B[x];


}



main.m
kernel = parallel.gpu.CUDAKernel( 'matrix_mult2.ptx', ...
                              'matrix_mult2.cu' );


kernel.ThreadBlockSize = [25,1,1];
kernel.GridSize = [1,1];


A = parallel.gpu.GPUArray.rand(5,5,'double');
B = parallel.gpu.GPUArray.rand(5,5,'double');
C = parallel.gpu.GPUArray.zeros(5,5);

C = feval(kernel,A,B,C); 
4

1 回答 1

1

You need to give the GPU some real work to do. In your current example, the only time-consuming operations are copying the data to the GPU and back. As the CPU doesn't have to perform these steps, it has an obvious advantage here. Try e.g. a real matrix multiplication of large matrices (not an element wise multiplication).

In slightly more formal terms, your kernel is PCIe bandwidth bound. To amortize the time spent copying N elements forth and back, you need to do some operations that are a lot more expensive than the data copying. Elementwise multiplication is cheap and scales linearly with N. Multiplication of N×N-matrices scales with N3 while the data transfer only scales with N2, so for large enough matrices matrix multiplication on the GPU will be faster than on the CPU.

于 2012-11-09T13:16:01.087 回答