cuda - 如何解释方括号中显示的数字？

Question

内核名称后方括号中显示的数字与启动该内核的 CUDA API 相关。（来自 GPU-Trace 和 API-Trace 模式）

内核名称后方括号中显示的数字是

94,
105,
2191,
2198.

那么NVIDIA CUDA Runtime API中的 CUDA API [94]（和其他）到底是什么？

==27706== Profiling application: matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GT 640M LE" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 35.36 GFlop/s, Time= 3.707 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
==27706== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
133.81ms  135.78us                    -               -         -         -         -  409.60KB  3.0167GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.62ms  270.66us                    -               -         -         -         -  819.20KB  3.0267GB/s  GeForce GT 640M         1         2  [CUDA memcpy HtoD]
134.90ms  3.7037ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [94]
138.71ms  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [105]
<...more output...>
1.24341s  3.7011ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2191]
1.24711s  3.7046ms            (20 10 1)       (32 32 1)        29  8.1920KB        0B         -           -  GeForce GT 640M         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int) [2198]
1.25089s  248.13us                    -               -         -         -         -  819.20KB  3.3015GB/s  GeForce GT 640M         1         2  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

score 3 · Accepted Answer

如果它说：

内核名称后方括号中显示的数字与启动该内核的 CUDA API调用相关。

如果您使用该--print-api-trace选项运行给定代码，您将获得该应用程序发出的所有 CUDA API 调用的顺序列表。如果您按顺序对它们进行编号，则与特定内核启动相关的编号将显示在--print-gpu-trace输出的方括号中。

这是一个完整的示例。请注意api-trace 输出和 gpu-trace 输出中的[105]、[106]和之间的相关性：[108]

$ cat t1.cu
__global__ void k(){}

int main(){

  k<<<1,1>>>();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
   Start  Duration  Name
116.17ms  3.0990us  cuDeviceGetPCIBusId
130.20ms     800ns  cuDeviceGetCount
130.20ms     251ns  cuDeviceGetCount
130.41ms  1.0500us  cuDeviceGet
130.41ms     705ns  cuDeviceGetAttribute
130.42ms     539ns  cuDeviceGetAttribute
130.42ms     547ns  cuDeviceGetAttribute
130.46ms     525ns  cuDeviceGetCount
130.46ms     277ns  cuDeviceGet
130.46ms  59.680us  cuDeviceGetName
130.52ms  63.802us  cuDeviceTotalMem
130.59ms     497ns  cuDeviceGetAttribute
130.59ms     226ns  cuDeviceGetAttribute
130.59ms     282ns  cuDeviceGetAttribute
130.59ms     234ns  cuDeviceGetAttribute
130.59ms     229ns  cuDeviceGetAttribute
130.59ms  34.628us  cuDeviceGetAttribute
130.62ms     372ns  cuDeviceGetAttribute
130.63ms     220ns  cuDeviceGetAttribute
130.63ms     284ns  cuDeviceGetAttribute
130.63ms     237ns  cuDeviceGetAttribute
130.63ms     222ns  cuDeviceGetAttribute
130.63ms     231ns  cuDeviceGetAttribute
130.63ms     288ns  cuDeviceGetAttribute
130.63ms     219ns  cuDeviceGetAttribute
130.63ms  3.1870us  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     275ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     213ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     336ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     213ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     210ns  cuDeviceGetAttribute
130.65ms     215ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms  320.65us  cuDeviceGetAttribute
130.97ms     322ns  cuDeviceGetAttribute
130.97ms     206ns  cuDeviceGetAttribute
130.97ms     218ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.98ms     226ns  cuDeviceGetAttribute
130.98ms     220ns  cuDeviceGetAttribute
130.98ms     212ns  cuDeviceGetAttribute
130.98ms     210ns  cuDeviceGetAttribute
130.98ms     206ns  cuDeviceGetAttribute
130.98ms     207ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms     211ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     229ns  cuDeviceGetAttribute
130.98ms     215ns  cuDeviceGetAttribute
130.98ms     216ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms  316.59us  cuDeviceGetAttribute
131.30ms     266ns  cuDeviceGetAttribute
131.30ms     252ns  cuDeviceGetAttribute
131.30ms     212ns  cuDeviceGetAttribute
131.30ms     235ns  cuDeviceGetAttribute
131.30ms     209ns  cuDeviceGetAttribute
131.30ms     272ns  cuDeviceGetAttribute
131.30ms     207ns  cuDeviceGetAttribute
131.30ms     735ns  cuDeviceGetAttribute
131.30ms     254ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     610ns  cuDeviceGetAttribute
131.31ms     273ns  cuDeviceGetAttribute
131.31ms     412ns  cuDeviceGetAttribute
131.31ms     216ns  cuDeviceGetAttribute
131.31ms     211ns  cuDeviceGetAttribute
131.31ms     205ns  cuDeviceGetAttribute
131.31ms  59.911ms  cudaLaunchKernel (k(void) [105])
191.23ms  11.222us  cudaLaunchKernel (k(void) [106])
191.24ms  5.7860us  cudaDeviceSynchronize
191.25ms  9.2890us  cudaLaunchKernel (k(void) [108])
191.26ms  5.1790us  cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
191.20ms  1.6000us              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [105]
191.22ms     896ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [106]
191.23ms     928ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [108]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

cuda - 如何解释方括号中显示的数字？

1 回答 1

Related

Reference