cuda - nvcc - different block sizes depending on arch at compile time

Question

I have a kernel which shows highest performance for different block sizes when running on Kepler and Fermi hardware. I would like, at compile-time, to check the current architecture target and define a THREADS_PER_BLOCK macro to i) launch the kernel with; ii) determine the number of blocks necessary; iii) statically set the shared memory size in the kernel.

The below demonstrates what I am attempting to do. Suppose I am targeting GK104 hardware, and hence use nvcc -arch=sm_30. This will still result in THREADS_PER_BLOCK = 256 since __CUDA_ARCH__ is not defined for the host code compilation. (I understand, from e.g. this answer, why it can't work this way.)

#if __CUDA_ARCH__ >= 300
#define THREADS_PER_BLOCK 512
#else
#define THREADS_PER_BLOCK 256
#endif

__global__ void some_kernel(int* a, int* b) {
    __shared__ sm_data[THREADS_PER_BLOCK];
    // Do something.
}

int main(void) {
    // Initialize data.
    // Calculate blocks based on THREADS_PER_BLOCK, problem size and some max.
    some_kernel<<blocks, THREADS_PER_BLOCK>>>(d_a, d_b)
    return 0;
}

I could check device properties at run-time and use dynamic shared memory, but would like to know if this can be hard-coded at compile-time without e.g. having to manually add a -DFERMI or -DKEPLER and setting THREADS_PER_BLOCK based on that. NB: Any users of this code will be compiling it themselves, almost certainly for one architecture, so this isn't an unreasonable option. It just seems superfluous in light of passing the -arch= flag.

score 2 · Accepted Answer

nvcc编译器不检测本地可用的 GPU，默认情况下它始终以 SM 1.0 为目标。否则，在不同系统上构建时可能会引入一些相当混乱的行为。

要针对可用设备进行编译，您要么需要让用户指定 SM 版本，要么在构建期间运行一些检测代码。我相信将硬件检测代码放入运行时然后根据需要配置内核启动会更容易。

cuda - nvcc - different block sizes depending on arch at compile time

1 回答 1

Related

Reference