1

我在 CUDA 中编写程序时遇到问题。我正在做的程序是一个加密,它执行一个矩阵乘以一个向量,并根据我引入的向量给我一个结果。问题是我在 C++ 和 CUDA 上都花时间,并且在 C++ 上给了我比 CUDA 更好的结果。我做的是做一个循环,因为我需要几个密钥来加密,代码如下:

t1 = clock();
do {

    HANDLE_ERROR ( cudaMemcpy(MAT_dev, MAT, nBytes, cudaMemcpyHostToDevice) );
    HANDLE_ERROR ( cudaMemcpy(VEC_dev, VEC, nBytes, cudaMemcpyHostToDevice) );

    mult<<< 1, b >>>(MAT_dev, VEC_dev, SOL_dev, b);

    HANDLE_ERROR ( cudaMemcpy(SOL, SOL_dev, nBytes, cudaMemcpyDeviceToHost) );

    for (i = 0; i < b; i++) {
        cout << SOL[i] << " ";
    }
    cout << endl;

    for (i = 0; i < b; i++) {
        VEC[i] = SOL[i];
    }

    cont = cont + 1;

} while (cont < w);
t2 = clock();

我的结果:

C++:11.474 分钟

CUDA:40.464 分钟

密钥的数量为 1,000,000。矩阵 7 x 7 和向量 7。

不知道它是否可以,或者我错过了一些让它更快的东西。

谢谢你的帮助。

4

1 回答 1

3

Possible problems of your code:

  1. spending most of the time on cudaMemcpy() and cout<<
  2. speed may be limited by the grid/block size. Generally speaking, # blocks in a grid should be >= # stream processes to fully utilize the GPU hardware; # threads in a block should be at least 64 and always be multipe of warp size.
  3. matrix/vector size too small to achieve good scalability

Possible solutions:

  1. Instead of doing 1,000,000 m_{7x7} * v_{7}, try to do 1 m_{7,000,000x7} * v_{7};
  2. try to merge 1,000,000 cudaMemcpy() into 1;
  3. Use cudaMallocPitch() to alloc memory for small matrices, which relax the aligment problem;
  4. try to use cublas_gemv() provided in cublas library if the element type of your matrix/vector is double/float

You may wish to read the CUDA C programming guide & C best practices guide before writing your own kernels

于 2013-01-08T23:43:28.607 回答