c++ - 比 Cuda 更好的 C++ 时序

Question

我在 CUDA 中编写程序时遇到问题。我正在做的程序是一个加密，它执行一个矩阵乘以一个向量，并根据我引入的向量给我一个结果。问题是我在 C++ 和 CUDA 上都花时间，并且在 C++ 上给了我比 CUDA 更好的结果。我做的是做一个循环，因为我需要几个密钥来加密，代码如下：

t1 = clock();
do {

    HANDLE_ERROR ( cudaMemcpy(MAT_dev, MAT, nBytes, cudaMemcpyHostToDevice) );
    HANDLE_ERROR ( cudaMemcpy(VEC_dev, VEC, nBytes, cudaMemcpyHostToDevice) );

    mult<<< 1, b >>>(MAT_dev, VEC_dev, SOL_dev, b);

    HANDLE_ERROR ( cudaMemcpy(SOL, SOL_dev, nBytes, cudaMemcpyDeviceToHost) );

    for (i = 0; i < b; i++) {
        cout << SOL[i] << " ";
    }
    cout << endl;

    for (i = 0; i < b; i++) {
        VEC[i] = SOL[i];
    }

    cont = cont + 1;

} while (cont < w);
t2 = clock();

我的结果：

C++：11.474 分钟

CUDA：40.464 分钟

密钥的数量为 1,000,000。矩阵 7 x 7 和向量 7。

不知道它是否可以，或者我错过了一些让它更快的东西。

谢谢你的帮助。

score 3 · Accepted Answer

Possible problems of your code:

spending most of the time on cudaMemcpy() and cout<<
speed may be limited by the grid/block size. Generally speaking, # blocks in a grid should be >= # stream processes to fully utilize the GPU hardware; # threads in a block should be at least 64 and always be multipe of warp size.
matrix/vector size too small to achieve good scalability

Possible solutions:

Instead of doing 1,000,000 m_{7x7} * v_{7}, try to do 1 m_{7,000,000x7} * v_{7};
try to merge 1,000,000 cudaMemcpy() into 1;
Use cudaMallocPitch() to alloc memory for small matrices, which relax the aligment problem;
try to use cublas_gemv() provided in cublas library if the element type of your matrix/vector is double/float

You may wish to read the CUDA C programming guide & C best practices guide before writing your own kernels

c++ - 比 Cuda 更好的 C++ 时序

C++：11.474 分钟

CUDA：40.464 分钟

1 回答 1

Related

Reference