cuda - 同样的代码，mex 要慢得多，而且纯 C，为什么？

Question

我有一个用于 Matlab 的 CUDA 程序，但 mex 版本比 Visual Studio 版本慢得多，尽管除了输入/输出参数的简短 mexFunction 之外，代码是相同的。mex 版本需要 3 秒，而纯 C 版本需要 0.5 秒。

我正在使用 Quadro K2000M 卡，CUDA 功能 3.0，CUDA 驱动程序 5.5，运行时 5.0，使用 Visual Studio 2010 编程。我按照 MATLAB 的 mexGPUExample.cu 步骤，仅将设置更改为 -gencode=arch=compute_30,code= \"sm_30,compute_30\"（删除低版本标志）。

详细来说，

纯 C 代码（在 Nsight 3.1 中为 Visual Sutdio 2010 项目创建，将代码生成更改为 compute_30,sm_30）

int main(int argc, char *argv[]){
clock_t begin, end;
double elapsed_time;

// some codes that prepare parameters from argc and argv

begin = clock();
a_function_that_calls_a_cuda_kernel(parameters);
end = clock();
elapsed_time = (double)(end - begin) / CLOCKS_PER_SEC;
printf("elapsed time: %f seconds\n", elapsed_time);

return 0;
}

Matlab mex 代码（遵循 MATLAB 的 mexGPUExample.cu，详细信息在http://www.mathworks.se/help/distcomp/create-and-run-mex-files- contains-cuda-code.html ，稍微修改了设置到 -gencode=arch=compute_30,code=\"sm_30,compute_30\")

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
clock_t begin, end;
double elapsed_time;

// some codes that prepare parameters from prhs

begin = clock();
a_function_that_calls_a_cuda_kernel(parameters);
end = clock();
elapsed_time = (double)(end - begin) / CLOCKS_PER_SEC;
mexPrintf("elapsed time: %f seconds\n", elapsed_time);      
}

mex 版本需要 3 秒，而纯 C 版本需要 0.5 秒，为什么？非常感谢任何提示。

score 1 · Accepted Answer

你的问题不清楚。我假设以下比较条件：

您有一个 CUDA 代码，当在 Visual Studio 下编译为独立程序时，它比mexFunction在 Matlab 下编译并被调用时更快。

您应该知道，第一次调用mexFunction是“慢”的，因为设置了 CUDA 上下文，内核由驱动程序处理，代码上传到 GPU。

因此，要对执行时间进行有意义的估计，首先应该通过调用一次内核来“预热”内核，然后计算后续调用的执行时间。如果代码非常快，则应将时间计算为多次调用的平均时间。

cuda - 同样的代码，mex 要慢得多，而且纯 C，为什么？

1 回答 1

Related

Reference