cuda - 每次使用 nvprof 调用 CUDA 内核函数时如何收集事件值？

Question

使用 nvprof 分析 CUDA 程序。

我再次发布问题。

与nvprof --events tex0_cache_sector_queries --replay-mode kernel ./matrixMul,

或 nvprof --events tex0_cache_sector_queries --replay-mode application ./matrixMul,

我们可以收集事件值结果：

==40013== Profiling application: ./matrixMul
==40013== Profiling result:
==40013== Event result:
"Device","Kernel","Invocations","Event Name","Min","Max","Avg","Total"
"Tesla K80 (0)","void matrixMulCUDA<int=32>(float*, float*, float*, int, int)",301,"tex0_cache_sector_queries",0,30,24,7224

以上结果为总结。内核函数matrixMulCUDA调用tex0_cache_sector_queries的301次调用值。它只有 301 次调用的min, max, avg, 总值，即汇总结果。

我想从每次调用 matrixMulCUDA 时收集完整的 301 次 tex0_cache_sector_queries 值。另一方面，每次调用内核函数 matrixMulCUDA 时，我都想收集 tex0_cache_sector_queries 事件值。如何收集？

score 1 · Accepted Answer

1 运行：

nvprof --pc-sampling-period 31 --print-gpu-trace --replay-mode application \
--export-profile application.prof --events tex0_cache_sector_queries ./matrixMul

2 将application.prof导入可视分析器：

视觉分析器结果

3 按照图片上的索引获取每个内核函数的事件值的每次调用。

4--print-gpu-trace参数：打印单个内核调用（包括CUDA memcpy's/memset's）并按时间顺序对其进行排序。在事件/指标分析模式下，显示每个内核调用的事件/指标可以解决此问题。打印 GPU 跟踪

cuda - 每次使用 nvprof 调用 CUDA 内核函数时如何收集事件值？

1 回答 1

Related

Reference