在 nvprof 中,我可以看到我正在使用的每个 cuda 执行流的流 ID(0、13、15 等)
给定一个流变量,我希望能够打印出流 ID。目前我找不到任何 API 来执行此操作,并且将其cudaStream_t
转换为 int 或 uint 不会产生合理的 ID。sizeof()
说cudaStream_t
是8个字节。
简而言之:我不知道直接访问这些 ID 的方法,但您可以为流提供显式名称以进行分析。
cudaStream_t
是一种不透明的“资源句柄”类型。资源句柄类似于指针;所以按道理说流 ID 不包含在指针(句柄)本身中,但不知何故包含在它所指的内容中。
由于它是不透明的(没有定义它指向的内容,由 CUDA 提供)并且正如您指出的那样,没有直接的 API,我认为您不会找到一种方法来cudaStream_t
在运行时从 a 中提取流 ID .
cudaStream_t
对于这些作为资源句柄且不透明的断言,请参阅 CUDA 头文件driver_types.h
但是,NV 工具扩展 API使您能够“命名”特定流(或其他资源)。这将允许您将源代码中的特定流与探查器中的特定名称相关联。
这是一个简单的工作示例:
$ cat t138.cu
#include <stdio.h>
#include <nvToolsExtCudaRt.h>
const long tdel = 1000000000ULL;
__global__ void tkernel(){
long st = clock64();
while (clock64() < st+tdel);
}
int main(){
cudaStream_t s1, s2, s3, s4;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);
cudaStreamCreate(&s4);
#ifdef USE_S_NAMES
nvtxNameCudaStreamA(s1, "stream 1");
nvtxNameCudaStreamA(s2, "stream 2");
nvtxNameCudaStreamA(s3, "stream 3");
nvtxNameCudaStreamA(s4, "stream 4");
#endif
tkernel<<<1,1,0,s1>>>();
tkernel<<<1,1,0,s2>>>();
tkernel<<<1,1,0,s3>>>();
tkernel<<<1,1,0,s4>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt
$ nvprof --print-gpu-trace ./t138
==28720== NVPROF is profiling process 28720, command: ./t138
==28720== Profiling application: ./t138
==28720== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
464.80ms 622.06ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 13 tkernel(void) [393]
464.81ms 621.69ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 14 tkernel(void) [395]
464.82ms 623.30ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 15 tkernel(void) [397]
464.82ms 622.69ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 16 tkernel(void) [399]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$ nvcc -arch=sm_61 -o t138 t138.cu -lnvToolsExt -DUSE_S_NAMES
$ nvprof --print-gpu-trace ./t138
==28799== NVPROF is profiling process 28799, command: ./t138
==28799== Profiling application: ./t138
==28799== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
457.98ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 1 tkernel(void) [393]
457.99ms 544.31ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 2 tkernel(void) [395]
458.00ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 3 tkernel(void) [397]
458.00ms 544.07ms (1 1 1) (1 1 1) 8 0B 0B - - TITAN X (Pascal 1 stream 4 tkernel(void) [399]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$