parallel-processing - CUDA 平铺矩阵乘法解释

Question

template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block
int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A
int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B
int bStep  = BLOCK_SIZE * wB;

....
....

内核的这一部分对我来说非常棘手。我知道矩阵 A 和 B 表示为数组 (*float)，并且由于共享内存块，我还知道使用共享内存来计算点积的概念。

我的问题是我不理解代码的开头，特别是 3 个特定变量（aBegin和aEnd）bBegin。有人可以为我制作一个可能执行的示例图，以帮助我了解索引在这种特定情况下是如何工作的吗？谢谢

score 3 · Accepted Answer

这是一张图，用于理解为 CUDA 内核的第一个变量设置的值以及执行的整体计算：

矩阵使用行优先顺序存储。CUDA 代码假定矩阵大小可以除以BLOCK_SIZE.

矩阵A,B和C根据内核 CUDA 网格实际上分成块。的所有块都C可以并行计算。对于给定的深灰色块C，主循环遍历和的几个浅灰色块A（B步调一致）。每个块都是使用BLOCK_SIZE * BLOCK_SIZE线程并行计算的。

bxby是当前块在 CUDA 网格中的基于块的位置。tx并且ty是当前线程在 CUDA 网格的当前计算块中计算的基于单元的位置。

这里对aBegin变量进行详细分析： aBegin指的是矩阵的第一个计算块A的第一个单元格的内存位置。之所以设置为，是wA * BLOCK_SIZE * by因为每个块都包含BLOCK_SIZE * BLOCK_SIZE单元格，并且在当前计算的块上方有wA / BLOCK_SIZE水平块和块。因此，.byA(BLOCK_SIZE * BLOCK_SIZE) * (wA / BLOCK_SIZE) * by = BLOCK_SIZE * wA * by

同样的逻辑适用于bBegin：它被设置为是BLOCK_SIZE * bx因为内存中在矩阵的第一个计算块的第一个单元之前存在bx大小块。BLOCK_SIZEB

a在循环中递增，aStep = BLOCK_SIZE以便下一个计算块是的当前计算块右侧（在图上）的以下内容A。b在同一个循环中递增，bStep = BLOCK_SIZE * wB以便下一个计算块是的当前计算块的底部（在图上）之后B。

parallel-processing - CUDA 平铺矩阵乘法解释

1 回答 1

Related

Reference