cuda - Thrust::Sort 编译时间很长

Question

我正在尝试使用 Thrust 编译一个示例代码块，以帮助学习一些 CUDA。

我正在使用 Visual Studio 2010，并且我已经获得了其他要编译的示例。然而，当我编译这个例子时，编译需要超过 10 分钟。我有选择地注释掉了行，并发现它的 Thrust::sort 行需要永远（注释掉一行大约需要 5 秒编译）。

我在某处发现了一篇文章，其中谈到了在 Thrust 中编译 sort 的速度有多慢，这是 Thrust 开发团队做出的决定（它在运行时快 3 倍，但编译时间更长）。但那个帖子是在 2008 年底。

知道为什么这需要这么长时间吗？

另外，我正在使用以下规格的机器上编译，所以它不是一台慢速机器

i7-2600k @ 4.5 ghz
16 GB DDR3 @ 1833 mhz
Raid 0 of 6 GB/s 1TB 驱动器

根据要求，这是看起来像 Visual Studio 正在调用的构建字符串

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe" -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I"C:\ Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -G0 --keep-dir "Debug\" -maxrregcount=32 --machine 64 --compile -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo / od /Zi /MTd " -o "Debug\kernel.obj" "C:\Users\Rob\Desktop\VS2010Test\VS2010Test\VS2010Test\kernel.cpp" -clean

例子

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
int main(void)
{
    // generate 16M random numbers on the host
    thrust::host_vector<int> h_vec(1 << 24);
    thrust::generate(h_vec.begin(), h_vec.end(), rand);
    // transfer data to the device
    thrust::device_vector<int> d_vec = h_vec;
    // sort data on the device
    thrust::sort(d_vec.begin(), d_vec.end());
    // transfer data back to host
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    return 0;
}

score 1 · Accepted Answer

CUDA 3.2 中的编译器未针对编译长而复杂的程序（如sort使用调试模式（即nvcc -G0））进行优化。你会发现 CUDA 4.0 在这种情况下要快得多。删除该-G0选项也应该会大大减少编译时间。

cuda - Thrust::Sort 编译时间很长

1 回答 1

Related

Reference