3

I'm trying to figure out what exactly each of the metrics reported by "nvprof" are. More specifically I can't figure out which transactions are System Memory and Device Memory read and writes. I wrote a very basic code just to help figure this out.

#define TYPE float
#define BDIMX 16
#define BDIMY 16
#include <cuda.h>
#include <cstdio>
#include <iostream>
__global__ void kernel(TYPE *g_output, TYPE *g_input, const int dimx, const int dimy)
{
__shared__ float s_data[BDIMY][BDIMX];
  int ix = blockIdx.x * blockDim.x + threadIdx.x;
  int iy = blockIdx.y * blockDim.y + threadIdx.y;
  int in_idx = iy * dimx + ix; // index for reading input
  int tx = threadIdx.x; // thread’s x-index into corresponding shared memory tile  
  int ty = threadIdx.y; // thread’s y-index into corresponding shared memory tile 
  s_data[ty][tx] = g_input[in_idx];
  __syncthreads();
  g_output[in_idx] = s_data[ty][tx] * 1.3;
  }


int main(){
  int size_x = 16, size_y = 16;
  dim3 numTB;
    numTB.x = (int)ceil((double)(size_x)/(double)BDIMX) ;
    numTB.y = (int)ceil((double)(size_y)/(double)BDIMY) ;
  dim3 tbSize; 
  tbSize.x = BDIMX;
  tbSize.y = BDIMY;
  float* a,* a_out;
  float *a_d = (float *) malloc(size_x * size_y * sizeof(TYPE));
  cudaMalloc((void**)&a,     size_x * size_y * sizeof(TYPE));
  cudaMalloc((void**)&a_out, size_x * size_y * sizeof(TYPE));
  for(int index = 0; index < size_x * size_y; index++){
      a_d[index] = index;
   }
  cudaMemcpy(a, a_d, size_x * size_y * sizeof(TYPE), cudaMemcpyHostToDevice);
  kernel <<<numTB, tbSize>>>(a_out, a, size_x, size_y);
  cudaDeviceSynchronize();
  return 0;
}

Then I run nvprof --metrics all for the output to see all the metrics. This is the part I'm interested in:

           Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K40c (0)"
  Kernel: kernel(float*, float*, int, int)
    local_load_transactions                   Local Load Transactions           0           0           0
   local_store_transactions                  Local Store Transactions           0           0           0
   shared_load_transactions                  Shared Load Transactions           8           8           8
  shared_store_transactions                 Shared Store Transactions           8           8           8
           gld_transactions                  Global Load Transactions           8           8           8
           gst_transactions                 Global Store Transactions           8           8           8
   sysmem_read_transactions           System Memory Read Transactions           0           0           0
  sysmem_write_transactions          System Memory Write Transactions           4           4           4
     tex_cache_transactions                Texture Cache Transactions           0           0           0
     dram_read_transactions           Device Memory Read Transactions           0           0           0
    dram_write_transactions          Device Memory Write Transactions          40          40          40
       l2_read_transactions                      L2 Read Transactions          70          70          70
      l2_write_transactions                     L2 Write Transactions          46          46          46

I understand the shared and global accesses. The global accesses are coalesced and since there are 8 warps, there are 8 transactions. But I can't figure out the system memory and device memory write transaction numbers.

4

1 回答 1

4

如果您有一个包含逻辑空间和物理空间的 GPU 内存层次模型(例如此处的模型),将会有所帮助。

参考“概览选项卡”图:

  1. gld_transactions 是指从针对全局逻辑空间的 warp 发出的事务。在图中,这将是从左侧的“内核”框到右侧的“全局”框的连线,逻辑数据移动方向是从右到左。

  2. gst_transactions 指的是与上面相同的行,但逻辑上是从左到右。请注意,这些逻辑全局事务可能会在缓存中命中,之后不会去任何地方。从指标的角度来看,这些交易类型仅指图表上指示的行。

  3. dram_write_transactions 指的是图中右边的设备内存与二级缓存连接的那条线,这条线上的逻辑数据流是从左到右的。由于 L2 缓存线是 32 字节(而 L1 缓存线和全局事务的大小是 128 字节),设备内存事务也是 32 字节,而不是 128 字节。所以一个通过 L1 的全局写事务(如果启用它是一个直写缓存)和 L2 将生成 4 个 dram_write 事务。这应该可以解释 40 笔交易中的 32 笔。

  4. 系统内存事务以零拷贝主机内存为目标。你似乎没有,所以我无法解释这些。

请注意,在某些情况下,对于某些指标,在某些 GPU 上,分析器在启动非常少量的线程块时可能会出现一些“不准确”。例如,某些指标是在每个 SM 的基础上进行采样和缩放的。(但是,设备内存事务不在此类别中)。如果您在每个 SM 上执行不同的工作(可能是由于启动的线程块数量非常少),那么缩放可能会产生误导/不太准确。通常,如果您启动大量线程块,这些通常会变得微不足道。

这个答案可能也很有趣。

于 2016-04-20T23:30:08.393 回答