I have a kernel which shows highest performance for different block sizes when running on Kepler and Fermi hardware. I would like, at compile-time, to check the current architecture target and define a THREADS_PER_BLOCK
macro to i) launch the kernel with; ii) determine the number of blocks necessary; iii) statically set the shared memory size in the kernel.
The below demonstrates what I am attempting to do. Suppose I am targeting GK104 hardware, and hence use nvcc -arch=sm_30
. This will still result in THREADS_PER_BLOCK = 256
since __CUDA_ARCH__
is not defined for the host code compilation. (I understand, from e.g. this answer, why it can't work this way.)
#if __CUDA_ARCH__ >= 300
#define THREADS_PER_BLOCK 512
#else
#define THREADS_PER_BLOCK 256
#endif
__global__ void some_kernel(int* a, int* b) {
__shared__ sm_data[THREADS_PER_BLOCK];
// Do something.
}
int main(void) {
// Initialize data.
// Calculate blocks based on THREADS_PER_BLOCK, problem size and some max.
some_kernel<<blocks, THREADS_PER_BLOCK>>>(d_a, d_b)
return 0;
}
I could check device properties at run-time and use dynamic shared memory, but would like to know if this can be hard-coded at compile-time without e.g. having to manually add a -DFERMI or -DKEPLER and setting THREADS_PER_BLOCK
based on that. NB: Any users of this code will be compiling it themselves, almost certainly for one architecture, so this isn't an unreasonable option. It just seems superfluous in light of passing the -arch=
flag.