1

Context: I am currently learning how to properly use CUDA, in particular how to generate random numbers using CURAND. I learned here that it might be wise to generate my random numbers directly when I need them, inside the kernel which performs the core calculation in my code.

Following the documentation, I decided to play a bit and try come up with a simple running piece of code which I can later adapt to my needs.

I excluded MTGP32 because of the limit of 256 concurrent threads in a block (and just 200 pre-generated parameter sets). Besides, I do not want to use doubles, so I decided to stick to the default generator (XORWOW).

Problem: I am having a hard time understanding why the same seed value in my code is generating different sequences of numbers for a number of threads per block bigger than 128 (when blockSize<129, everything runs as I would expect). After doing proper CUDA error checking, as suggested by Robert in his comment, it is somewhat clear that hardware limitations play a role. Moreover, not using "-G -g" flags at compile time raises the "trouble for threshold" from 128 to 384.

Questions: What exactly is causing this? Robert worte in his comment that "it might be a registers per thread issue". What does this mean? Is there an easy way to look at the hardware specs and say where this limit will be? Can I get around this issue without having to generate more random numbers per thread?

A related issue seems to have been discussed here but I do not think it applies to my case.

My code (see below) was mostly inspired by these examples.

Code:

    #include <stdio.h>
    #include <stdlib.h>
    #include <cuda.h>
    #include <curand_kernel.h>

    #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
    inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true){
        if (code != cudaSuccess){ 
           fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
           if (abort) exit(code);
        }
    }

    __global__ void setup_kernel(curandState *state, int seed, int n){

        int id = threadIdx.x + blockIdx.x*blockDim.x;

        if(id<n){
            curand_init(seed, id, 0, &state[id]);
        }
    }

    __global__ void generate_uniform_kernel(curandState *state, float *result, int n){

        int id = threadIdx.x + blockIdx.x*blockDim.x;
        float x;

        if(id<n){
            curandState localState = state[id];
            x = curand_uniform(&localState);
            state[id] = localState;
            result[id] = x; 
        }
    }

    int main(int argc, char *argv[]){

        curandState *devStates;
        float *devResults, *hostResults;

        int n = atoi(argv[1]);
        int s = atoi(argv[2]);
        int blockSize = atoi(argv[3]);

        int nBlocks = n/blockSize + (n%blockSize == 0?0:1);

        printf("\nn: %d, blockSize: %d, nBlocks: %d, seed: %d\n", n, blockSize, nBlocks, s);

        hostResults = (float *)calloc(n, sizeof(float));
        cudaMalloc((void **)&devResults, n*sizeof(float));

        cudaMalloc((void **)&devStates, n*sizeof(curandState));
        setup_kernel<<<nBlocks, blockSize>>>(devStates, s, n);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk( cudaDeviceSynchronize() );

        generate_uniform_kernel<<<nBlocks, blockSize>>>(devStates, devResults, n);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk( cudaDeviceSynchronize() );

        cudaMemcpy(hostResults, devResults, n*sizeof(float), cudaMemcpyDeviceToHost);

        for(int i=0; i<n; i++) {
            printf("\n%10.13f", hostResults[i]);
        }

        cudaFree(devStates);
        cudaFree(devResults);
        free(hostResults);

        return 0;
    }

I compiled two binaries, one using the "-G -g" debugging flags and the other without. I named them rng_gen_d and rng_gen, respectively:

     $ nvcc -lcuda -lcurand -O3 -G -g --ptxas-options=-v rng_gen.cu -o rng_gen_d
    ptxas /tmp/tmpxft_00002257_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
    ptxas info    : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
    ptxas info    : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
    ptxas info    : Used 43 registers, 32 bytes smem, 72 bytes cmem[1], 6480 bytes lmem
    ptxas info    : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
    ptxas info    : Used 10 registers, 36 bytes smem, 40 bytes cmem[1], 48 bytes lmem

     $ nvcc -lcuda -lcurand -O3 --ptxas-options=-v rng_gen.cu -o rng_gen
    ptxas /tmp/tmpxft_00002b73_00000000-5_rng_gen.ptx, line 533; warning : Double is not supported. Demoting to float
    ptxas info    : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
    ptxas info    : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
    ptxas info    : Used 20 registers, 32 bytes smem, 48 bytes cmem[1], 6440 bytes lmem
    ptxas info    : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
    ptxas info    : Used 19 registers, 36 bytes smem, 4 bytes cmem[1]

To start with, there is a strange warning message at compile time (see above):

    ptxas /tmp/tmpxft_00002b31_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float

Some debugging showed that the line causing this warning is:

    curandState localState = state[id];

There are no doubles declared, so I do not know exactly how to solve this (or even if this needs solving).

Now, an example of the (actual) problem I am facing:

     $ ./rng_gen_d 5 314 127

    n: 5, blockSize: 127, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen_d 5 314 128

    n: 5, blockSize: 128, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen_d 5 314 129

    n: 5, blockSize: 129, nBlocks: 1, seed: 314
    GPUassert: too many resources requested for launch rng_gen.cu 54

Line 54 is gpuErrchk() right after setup_kernel().

With the other binary (no "-G -g" flags at compile time), the "threshold for trouble" is raised to 384:

     $ ./rng_gen 5 314 129

    n: 5, blockSize: 129, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen 5 314 384 

    n: 5, blockSize: 384, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen 5 314 385

    n: 5, blockSize: 385, nBlocks: 1, seed: 314
    GPUassert: too many resources requested for launch rng_gen.cu 54

Finally, should this be somehow related to the hardware I am using for this preliminary testing (the project will be later launched on a much more powerful machine), here are the specs of the card I am using:

    ./deviceQuery Starting...

     CUDA Device Query (Runtime API) version (CUDART static linking)

    Detected 1 CUDA Capable device(s)

    Device 0: "Quadro NVS 160M"
      CUDA Driver Version / Runtime Version          5.5 / 5.5
      CUDA Capability Major/Minor version number:    1.1
      Total amount of global memory:                 256 MBytes (268107776 bytes)
      ( 1) Multiprocessors, (  8) CUDA Cores/MP:     8 CUDA Cores
      GPU Clock rate:                                1450 MHz (1.45 GHz)
      Memory Clock rate:                             702 Mhz
      Memory Bus Width:                              64-bit
      Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
      Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       16384 bytes
      Total number of registers available per block: 8192
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  768
      Maximum number of threads per block:           512
      Max dimension size of a thread block (x,y,z): (512, 512, 64)
      Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             256 bytes
      Concurrent copy and kernel execution:          No with 0 copy engine(s)
      Run time limit on kernels:                     Yes
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Disabled
      Device supports Unified Addressing (UVA):      No
      Device PCI Bus ID / PCI location ID:           1 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Quadro NVS 160M
    Result = PASS

And this is it. Any guidance on this matter will most welcome. Thanks!

EDIT:

1) Added proper cuda error checking, as suggested by Robert.

2) Deleted the cudaMemset line, which was useless anyway.

3) Compiled and ran the code without the "-G -g" flags.

4) Updated the output accordingly.

4

1 回答 1

2

首先,当您遇到 CUDA 代码问题时,始终建议您进行适当的cuda 错误检查。它会消除一定程度的挠头,可能会为您节省一些时间,并且肯定会提高人们在此类网站上为您提供帮助的能力。

现在您发现每个线程问题都有一个寄存器。编译器在生成代码时会将寄存器用于各种目的。每个线程都需要这些寄存器的补充来运行它的线程代码。当您尝试启动内核时,必须满足的要求之一是每个线程所需的寄存器数乘以启动中请求的线程数必须小于每个块可用的寄存器总数。请注意,每个线程所需的寄存器数量可能必须向上舍入到某个粒度分配增量。另请注意,由于线程在经纱中启动,请求的线程数通常会向上取整到下一个更高的增量 32(如果不能被 32 整除)32 个。还请注意,每个块的最大寄存器因计算能力而异,并且可以通过deviceQuery示例检查此数量,如您所示。此外,正如您所发现的,某些命令行开关(例如)-G会影响 nvcc 使用寄存器的方式。

要提前通知这些类型的资源问题,您可以使用其他命令行开关编译代码:

nvcc -arch=sm_11 -Xptxas=-v -o mycode mycode.cu

-Xptxas=-v开关将由ptxas 汇编器生成资源使用输出(它将中间ptx 代码转换为sass 汇编代码,即机器代码),包括每个线程所需的寄存器。请注意,在这种情况下,输出将按内核交付,因为每个内核可能有自己的要求。您可以在文档中获得有关 nvcc 编译器的更多信息。

作为一种粗略的解决方法,您可以在编译时指定一个开关,以将所有内核编译限制为最大寄存器使用数:

nvcc -arch=sm_11 -Xptxas=-v -maxrregcount=16 -o mycode mycode.cu

这将限制每个内核每个线程使用不超过 16 个寄存器。当乘以 512(cc1.x 设备的每个块的线程的硬件限制)时,得出的值为 8192,这是您设备的每个线程块的总寄存器的硬件限制。

然而,上述方法很粗糙,因为它对程序中的所有内核都应用了相同的限制。如果您想针对每次内核启动进行调整(例如,如果您的程序中的不同内核启动不同数量的线程),您可以使用此处描述的启动边界方法。

于 2013-11-02T01:19:46.763 回答