cuda - 对应该执行超过 1 次的 CUDA 内核进行计时

Question

我想计算应该运行1次以上的内核时间，每个正在执行的内核要处理的数据是不同的。我的代码在下面，不应该计算cudaMemcpy的时间。

1 cudaEvent_t start;
2 error = cudaEventCreate(&start);
3 cudaEvent_t stop;
4 error = cudaEventCreate(&stop);
6 float msecTotal = 0.0f;
7 int nIter = 300;
8 for (int j = 0; j < nIter; j++)
9 {            
10      cudaMemcpy(...);
        // Record the start event
11      error = cudaEventRecord(start, NULL);
12      matrixMulCUDA1<<< grid, threads >>>(...);
       // Record the stop event
13      error = cudaEventRecord(stop, NULL);
14      error = cudaEventSynchronize(stop);
15      float msec = 0.0f;
16      error = cudaEventElapsedTime(&msec, start, stop);
17      msecTotal+=msec;
18 }
19 cout<<"Total time = "<<msecTotal<<endl;

公平地说，对比算法应该如下：

1 cudaEvent_t start;
2 error = cudaEventCreate(&start);
3 cudaEvent_t stop;
4 error = cudaEventCreate(&stop);
6 float msecTotal = 0.0f;
7 int nIter = 300;
8 for (int j = 0; j < nIter; j++)
9 {
        // Record the start event    
11      error = cudaEventRecord(start, NULL);
12      matrixMulCUDA2<<< grid, threads >>>(...);
       // Record the stop event
13      error = cudaEventRecord(stop, NULL);
14      error = cudaEventSynchronize(stop);
15      float msec = 0.0f;
16      error = cudaEventElapsedTime(&msec, start, stop);
17      msecTotal+=msec;
18 }
19 cout<<"Total time = "<<msecTotal<<endl;

我的问题是方法对吗？因为我不确定。显然，时间应该比正常时间更长。

score 1 · Accepted Answer

无论哪种方式，您都应该得到类似的结果。通过记录内核启动周围的事件，您肯定只测量了在内核中花费的时间，而不是在 memcpy 上花费的任何时间。

我唯一的缺点是，通过在循环的每次迭代中调用 cudaEventSynchronize()，您正在破坏 CPU/GPU 并发，这实际上对于获得良好性能非常重要。如果您必须单独为每个内核调用计时（而不是将 nIter 迭代的 for 循环放在内核调用周围而不是整个操作），您可能需要分配更多的 CUDA 事件。如果您走那条路，则每次循环迭代不需要 2 个事件 - 您需要用两个括号括起操作，并且每次循环迭代只需要记录一个 CUDA 事件。然后可以通过对相邻记录的事件调用 cudaEventElapsedTime() 来计算任何给定内核调用的时间。

记录 N 个事件之间的 GPU 时间：

cudaEvent_t events[N+2];

cudaEventRecord( events[0], NULL ); // record first event
for (j = 0; j < nIter; j++ ) {
    // invoke kernel, or do something else you want to time
    // cudaEventRecord( events[j+1], NULL );
}
cudaEventRecord( events[j], NULL );
// to compute the time taken for operation i, call:
float ms;
cudaEventElapsedTime( &ms, events[i+1], events[i] );

cuda - 对应该执行超过 1 次的 CUDA 内核进行计时

1 回答 1

Related

Reference