我试图了解来自 CUDA SDK 8.0 的示例代码是如何工作的:
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
....
....
内核的这一部分对我来说非常棘手。我知道矩阵 A 和 B 表示为数组 (*float),并且由于共享内存块,我还知道使用共享内存来计算点积的概念。
我的问题是我不理解代码的开头,特别是 3 个特定变量(aBegin
和aEnd
)bBegin
。有人可以为我制作一个可能执行的示例图,以帮助我了解索引在这种特定情况下是如何工作的吗?谢谢