是否可以用来nvprof
计算 CUDA 内核执行次数(即启动了多少内核)?
现在,当我运行 nprof 时,我看到的是:
==537== Profiling application: python tf.py
==537== Profiling result:
Time(%) Time Calls Avg Min Max Name
51.73% 91.294us 20 4.5640us 4.1280us 6.1760us [CUDA memcpy HtoD]
43.72% 77.148us 20 3.8570us 3.5840us 4.7030us [CUDA memcpy DtoH]
4.55% 8.0320us 1 8.0320us 8.0320us 8.0320us [CUDA memset]
==537== API calls:
Time(%) Time Calls Avg Min Max Name
90.17% 110.11ms 1 110.11ms 110.11ms 110.11ms cuDevicePrimaryCtxRetain
6.63% 8.0905ms 1 8.0905ms 8.0905ms 8.0905ms cuMemAlloc
0.57% 700.41us 2 350.21us 346.89us 353.52us cuMemGetInfo
0.55% 670.28us 1 670.28us 670.28us 670.28us cuMemHostAlloc
0.28% 347.01us 1 347.01us 347.01us 347.01us cuDeviceTotalMem
...