cuda - 使用 cudaEventRecord() 记录多 GPU 程序的 CUDA 内核的运行时间

Question

我有一个稀疏三角形求解器，可与 4 个 Tesla V100 GPU 配合使用。我完成了实施，所有事情在准确性方面都运行良好。但是，我使用 CPU 计时器来计算经过的时间。我知道 CPU 计时器不是计算经过时间的完美选择，因为我可以使用 CUDA 事件。

但问题是，我不知道如何为多 GPU 实现 CUDA 事件。正如我从 NVIDIA 教程中看到的，它们使用事件进行 GPU 间同步，即等待其他 GPU 完成依赖关系。无论如何，我将事件定义为；

cudaEvent_t start_events[num_gpus]
cudaEvent_t end_events[num_gpus]

我还可以通过迭代设置当前 GPU 来循环初始化这些事件。

我的内核执行就像；

 for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     kernel<<<>>>()
 }

 for(int i = 0; i < num_devices; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaDeviceSynchronize());
 }

我的问题是，我应该如何使用这些事件分别记录每个 GPU 的经过时间？

score 1 · Accepted Answer

您需要为每个 GPU 创建两个事件，并在每个 GPU 上记录内核调用前后的事件。

它可能看起来像这样：

cudaEvent_t start_events[num_gpus];
cudaEvent_t end_events[num_gpus];

for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaEventCreate(&start_events[i]));
     CUDA_FUNC_CALL(cudaEventCreate(&end_events[i]));
 }

 for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     // In cudaEventRecord, ommit stream or set it to 0 to record 
     // in the default stream. It must be the same stream as 
     // where the kernel is launched.
     CUDA_FUNC_CALL(cudaEventRecord(start_events[i], stream)); 
     kernel<<<>>>()
     CUDA_FUNC_CALL(cudaEventRecord(end_events[i], stream));
 }

 for(int i = 0; i < num_devices; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaDeviceSynchronize());
 }

 for(int i = 0; i < num_devices; i++)
 {
     //the end_event must have happened to get a valid duration
     //In this example, this is true because of previous device synchronization
     float time_in_ms;
     CUDA_FUNC_CALL(cudaEventElapsedTime(&time_in_ms, start_events[i], end_events[i]));
     printf("Elapsed time on device %d: %f ms\n", i, time_in_ms)
 }

for(int i = 0; i < num_gpus; i++)
 {
     CUDA_FUNC_CALL(cudaSetDevice(i));
     CUDA_FUNC_CALL(cudaEventDestroy(start_events[i]));
     CUDA_FUNC_CALL(cudaEventDestroy(end_events[i]));
 }

cuda - 使用 cudaEventRecord() 记录多 GPU 程序的 CUDA 内核的运行时间

1 回答 1

Related

Reference