cuda - CUDA GPU 比 CPU 慢

Question

我无法弄清楚为什么我的 cuda 代码运行速度比我的 cpu 代码慢

我的桌面配置是i7 2600S，geforce 560ti

我的代码如下：

int** kernel_shiftSeam(int **MCEnergyMat, int **newE, int *seam, int width, int height,     int direction)
{
//time measurement
float elapsed_time_ms = 0;
cudaEvent_t start, stop; //threads per block

dim3 threads(16,16);
//blocks
dim3 blocks((width+threads.x-1)/threads.x, (height+threads.y-1)/threads.y);

int *device_Seam;

int *host_Seam;

int seamSize;
if(direction == 1)
{
    seamSize = height*sizeof(int);
    host_Seam = (int*)malloc(seamSize);
    for(int i=0;i<height;i++)
    host_Seam[i] = seam[i];
}
else
{
    seamSize = width*sizeof(int);
    host_Seam = (int*)malloc(seamSize);
    for(int i=0;i<width;i++)
        host_Seam[i] = seam[i];
}

cudaMalloc((void**)&device_Seam, seamSize);
cudaMemcpy(device_Seam, host_Seam, seamSize, cudaMemcpyHostToDevice);

global_host_MC = MCEnergyMat;
new_host_MC = newE;

//copy host array to device
cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
    for(int i=0;i<width;i++)
        cudaMemcpy(global_MC2[i], global_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);

cudaMemcpy(new_MC, new_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
    for(int i=0;i<width;i++)
        cudaMemcpy(new_MC2[i], new_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);


cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

//do some operations on the 2d matrix
gpu_shiftSeam<<< blocks,threads >>>(global_MC, new_MC, device_Seam, width, height);

//measure end time for cpu calcuations
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );

execTime += elapsed_time_ms;

//copy out the data back to host (RESULT)
for(int i=0;i<width;i++)
{
    cudaMemcpy(newE[i], new_MC2[i], sizeof(int)*height, cudaMemcpyDeviceToHost);
}

return newE;
}

我循环了 800 次，得到了以下结果：

GPU 计算时间（gpu_shiftseam 部分）：1176 毫秒总程序运行时间：22 秒

CPU 计算时间（与 gpu_shiftseam 相同的操作，但在主机上）：12522ms 总程序运行时间：12s

显然 GPU 的计算时间比 CPU 上的要短，但由于某种原因，gpu 的总程序运行时间要长得多，有人知道为什么吗？是因为我分配的线程/块数不正确吗？还是来自在设备上分配内存的“缓慢”？

非常感谢！

score 2 · Accepted Answer

Im my experience memory accesses are the #1 reason for slowness.

Profile your array copies to see how much time is being spent. If it is a considerable amount, perhaps try optimizing your code. Instead of copying inside of a for-loop, perhaps see if you can copy sizeof(int *) * height * width directly. Reducing the amount of times you call memcpy should help.

cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
cudaMemcpy(global_MC2, global_host_MC, sizeof(int)*height*width,cudaMemcpyHostToDevice);

score 0 · Accepted Answer

我有类似的经历，发现 cudaMalloc 是瓶颈，而 cudaMemcpy 不是。在我的设备中，我记得分配 16 MB 需要 160 毫秒。然而，CUDA 内存分配可以在实际计算之前完成，例如，通过另一个函数调用。因此，内存分配时间可以从整体性能度量中删除，例如加速，尽管我会在加速计算中包括 cudaMemcpy 操作。

cuda - CUDA GPU 比 CPU 慢

2 回答 2

Related

Reference