cuda - CUDA 流的问题

Question

我通过细分输入矩阵 (A[x/num_of_streams*y] B[x y] = C[x/num_of_streams*y]) 在单个 GPU (Tesla C2050) 上的不同流上运行 CUBLAS v2.0，但不知何故我使用 CUDA 流需要更多时间。这是代码片段：

             //plan is a struct containing the matrix dimensions and stream numbers
             //parallel in nstreams - should be! MAX 16 streams could run concurrently
            //Copy A - cudaMemCpyAsync
            for(i = 0; i < nstreams; i++)
                    cudgemm_copyA_in_streams (&plan[i]);
            //Copy B - cudaMemCpyAsync
            for(i = 0; i < nstreams; i++)
                    cudgemm_copyB_in_streams (&plan[i]);

            //Create handles - serial
            for(i = 0; i < nstreams; i++)
                    handle[i] = create_handle();

            //Run kernels - first doing a cublasSetStream(handle, plan->stream) before running cublasDgemm... 
            for(i = 0; i < nstreams; i++)
                    cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);

            //Destroy handles - serial
            for(i = 0; i < nstreams; i++)
                    destroy_handle (handle[i]);

            //Copy C - cudaMemCpyAsync
            for(i = 0; i < nstreams; i++)
                    cudgemm_copyC_in_streams (&plan[i]);

            //EDIT: Function body

            //The other two copy functions are exactly the same as this
            void cudgemm_copyA_in_streams(TGPUplan *plan)
           {
                 cudasafe(cudaMemcpyAsync(plan->Ad_Data, plan->Ah_Data, (plan->Acols * plan->Arows * sizeof(double)), cudaMemcpyHostToDevice, plan->stream) );

            }

            //Create handle
            cublasHandle_t create_handle ()
            {
                   cublasHandle_t handle;
                   checkError(cublasCreate(&handle), "cublasCreate() error!\n");
                   return handle;
             }

             //Destroy handle
             void destroy_handle (cublasHandle_t handle)
             {
                  checkError(cublasDestroy(handle), "cublasDestroy() error!\n");
             }

             //Kernel
             void cudgemm_kernel_in_streams(TGPUplan *plan, cublasHandle_t handle, const double alpha, const double beta)
             {
                   cublasStatus_t ret;
                   cublasSetStream(handle, plan->stream);

                   ret = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, plan->Arows, plan->Ccols, plan->Acols, &alpha, plan->Ad_Data, plan->Arows, plan->Bd_Data, plan->Brows, &beta, plan->Cd_Data, plan->Crows);
                   checkError(ret, "cublas Dgemm returned an error!\n");
              }

因此，我在流和分配工作之间来回切换，期望获得更好的执行时间，但我注意到与不使用流的版本相比，流的数量越多，程序需要的时间就越多。我哪里错了？跨帖子到 Nvidia 论坛 - http://forums.nvidia.com/index.php?showtopic=209420

编辑：

我修改了我的程序如下：

            //copy data
            for(i = 0; i < nstreams; i++)
            {
                    cudgemm_copyA_in_streams (&plan[i]);
                    cudgemm_copyB_in_streams (&plan[i]);
            }

            //Run kernel and copy back
            for(i = 0; i < nstreams; i++)
            {
                    cudgemm_kernel_in_streams (&plan[i], handle[i], 1.0f, 1.0f);
                    cudgemm_copyC_in_streams (&plan[i]);
            }

当我针对 6144 的矩阵顺序分析我的程序时，我得到以下信息：

Kernel time = 42.75 % of total GPU time 
Memory copy time = 28.9 % of total GPU time
Kernel taking maximum time = fermiDgemm_v2_kernel_val (42.8% of total GPU time)
Memory copy taking maximum time = memcpyHtoDasync (21.7% of total GPU time)
Total overlap time in GPU = 65268.3 micro sec. (3.6% of total GPU time)

蓝色 = 内核，绿色 = 2 个流中的 cudaMemCpyAsync

当我为上述循环计时时，我得到的时间为 0.000284 秒，而不使用流的版本为 1.703289 秒（在该版本中，我也对两个顺序内存副本、内核调用和剩余的 memCpy 计时）。我认为由于我没有使用任何同步结构，可能是我在计算实际完成之前打印了时间（我发现很难相信有 100% 的改进）。

score 2 · Accepted Answer

我建议两个改变：

1) 将您的 cuBLAS 句柄创建/销毁移到副本和内核调用之外。通过执行不需要的上下文同步可能会破坏并发性。

2）在流上的一个循环中一起做memcpy。这样，流 0 的 B 副本不会执行任何额外的同步来等待 A memcpy 完成。即这样做：

        for(i = 0; i < nstreams; i++) {
                cudgemm_copyA_in_streams (&plan[i]);
                cudgemm_copyB_in_streams (&plan[i]);
        }

不是这个：

        for(i = 0; i < nstreams; i++)
                cudgemm_copyA_in_streams (&plan[i]);
        for(i = 0; i < nstreams; i++)
                cudgemm_copyB_in_streams (&plan[i]);

如果您无法从重叠传输和计算中获得超过 40% 左右的加速，请不要感到惊讶。流为花费相同时间传输和处理数据的工作负载提供了最大的好处，并且很少有工作负载属于该类别。

score 1 · Accepted Answer

我还建议检查副本的大小，只有当传输一块内存的时间可以与计算它所需的时间相比较时，你才应该开始使用不同的流。如果传输时间与计算时间相比很少，那么添加流会增加管理开销。使用 Visual Profiler 查看各个步骤需要多长时间。使用不同的内存输入制作图表。

cuda - CUDA 流的问题

2 回答 2

Related

Reference