cuda - 定时 CUDA 内核

Question

大家好，我目前正在为我的一些 CUDA 代码计时。我能够使用事件为他们计时。我的内核运行了 19 毫秒。不知何故，我觉得这很可疑，因为当我运行它的顺序实现时，它大约是 5000 毫秒。我知道代码应该运行得更快，但它应该这么快吗？

我正在使用包装函数在我的 cpp 程序中调用 cuda 内核。我应该在那里或在 .cu 文件中调用它们吗？谢谢！

score 1 · Accepted Answer

The obvious way to check if your program is working would be to compare the output to that of your CPU based implementation. If you get the same output, it is working by definition, right? :)

If your program is experimental in such a way that it doesn't really produce any verifiable output then there is a good chance that the compiler has optimized out some (or all) of your code. The compiler will remove code that does not contribute to output data. This can cause, for instance, that the entire contents of a kernel is removed if the final statement that stores the calculated value is commented out.

As to your speedup. 5000ms / 19ms = 263x, which is an unlikely increase, even for algorithms that map perfectly to the GPU architecture.

score 0 · Accepted Answer

好吧，如果你正确地编写了你的 CUDA 代码，是的，它可能会快得多。想想看。您将代码从单个处理器上的顺序执行移动到数百个处理器上的并行执行，具体取决于您的 GPU 型号。我 179 美元的中档显卡有 480 个内核。现在一些可用的有 1500 个内核。使用 CUDA 非常有可能获得 100 倍的性能跳跃，特别是如果您的内核比内存限制更多的计算绑定。

也就是说，确保你测量的是你认为你正在测量的东西。如果您在不使用任何显式流的情况下调用 CUDA 内核，则调用与主机线程同步，并且您的时间应该是准确的。如果您使用流调用内核，则需要调用 cudaDeviceSynchronise() 或让主机代码等待内核发出的事件信号。在流上调用的内核调用与主机线程异步执行，因此主机线程中的时间测量将无法正确反映内核时间，除非您让主机线程等待内核调用完成。您还可以使用 CUDA 事件来测量给定流中 GPU 上的经过时间。请参阅 NVidia GPU Computing SDK 4.2 中 CUDA 最佳实践指南的第 5.1.2 节。

score -3 · Accepted Answer

在我自己的代码中，我使用该clock()函数来获取精确的时间。为方便起见，我有宏

enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif

然后可以使用这些来检测设备代码，如下所示：

__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }

然后您可以阅读cuda_timers主机代码中的内容。

几点注意事项：

计时器在每个块的基础上工作，即如果您有 100 个块执行同一个内核，则将存储它们所有时间的总和。
计时器计算时钟滴答的数量。要获得毫秒数，请将其除以设备上的 GHz 数，然后乘以 1000。
计时器会稍微减慢您的代码速度，这就是为什么我将它们包装在其中#ifdef USETIMERS以便您可以轻松地关闭它们。
虽然clock()返回类型为的整数值clock_t，但我将累积值存储为float，否则对于耗时超过几秒的内核（在所有块上累积），这些值将环绕。
选择( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) )是必要的，以防时钟计数器回绕。

cuda - 定时 CUDA 内核

3 回答 3

Related

Reference