我有以下矩阵乘法代码,使用 CUDA 3.2 和 VS 2008 实现。我在 Windows server 2008 r2 企业版上运行。我正在运行 Nvidia GTX 480。以下代码适用于“宽度”(矩阵宽度)的值高达 2500 左右。
int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err = cudaSuccess;
//Allocate Device Memory for M, N and P
err = cudaMalloc((void**)&Md, size);
err = cudaMalloc((void**)&Nd, size);
err = cudaMalloc((void**)&Pd, size);
//Copy Matrix from Host Memory to Device Memory
err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
//Setup the execution configuration
dim3 dimBlock(TileWidth, TileWidth, 1);
dim3 dimGrid(ceil((float)(Width)/TileWidth), ceil((float)(Width)/TileWidth), 1);
MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
//Free Device Memory
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
当我将“宽度”设置为 3000 或更大时,黑屏后出现以下错误:
上网查了一下,看到有人出现这个问题,是因为看门狗在内核挂起超过5秒后就杀掉了。我尝试在注册表中编辑“TdrDelay”,这延迟了黑屏和出现相同错误之前的时间。所以我得出结论,这不是我的问题。
我调试了我的代码,发现这行是罪魁祸首:
err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
这是我在调用矩阵乘法内核函数后用来从设备返回结果集的方法。到目前为止,一切似乎都运行良好。我相信我正在正确分配内存并且无法弄清楚为什么会发生这种情况。我想也许我的卡上没有足够的内存来做这个,但是 cudaMalloc 不应该返回错误吗?(我在调试时确认它没有)。
任何想法/帮助将不胜感激!...非常感谢大家!
内核代码:
//Matrix Multiplication Kernel - Multi-Block Implementation
__global__ void MatrixMultiplicationMultiBlock_Kernel (float* Md, float* Nd, float* Pd, int Width)
{
int TileWidth = blockDim.x;
//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + threadIdx.y;
int Column = (TileWidth*blockIdx.x) + threadIdx.x;
//Pvalue store the Pd element that is computed by the thread
float Pvalue = 0;
for (int i = 0; i < Width; ++i)
{
float Mdelement = Md[Row * Width + i];
float Ndelement = Nd[i * Width + Column];
Pvalue += Mdelement * Ndelement;
}
//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}
我也有这个使用共享内存的其他功能,它也给出了同样的错误:
称呼:
MatrixMultiplicationSharedMemory_Kernel<<<dimGrid, dimBlock, sizeof(float)*TileWidth*TileWidth*2>>>(Md, Nd, Pd, Width);
内核代码:
//Matrix Multiplication Kernel - Shared Memory Implementation
__global__ void MatrixMultiplicationSharedMemory_Kernel (float* Md, float* Nd, float* Pd, int Width)
{
int TileWidth = blockDim.x;
//Initialize shared memory
extern __shared__ float sharedArrays[];
float* Mds = (float*) &sharedArrays;
float* Nds = (float*) &Mds[TileWidth*TileWidth];
int tx = threadIdx.x;
int ty = threadIdx.y;
//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + ty;
int Column = (TileWidth*blockIdx.x) + tx;
float Pvalue = 0;
//For each tile, load the element into shared memory
for( int i = 0; i < ceil((float)Width/TileWidth); ++i)
{
Mds[ty*TileWidth+tx] = Md[Row*Width + (i*TileWidth + tx)];
Nds[ty*TileWidth+tx] = Nd[(ty + (i * TileWidth))*Width + Column];
__syncthreads();
for( int j = 0; j < TileWidth; ++j)
{
Pvalue += Mds[ty*TileWidth+j] * Nds[j*TileWidth+tx];
}
__syncthreads();
}
//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}