random - cuRAND performs much worse than thrust when generating random numbers inside CUDA kernels

Question

I am trying to generate "random" numbers from a uniform distribution inside a CUDA __global__ kernel using two different approaches. The first is using the cuRAND device API, and the second is using thrust. For each approach I have created a different class.

Here is my cuRAND solution:

template<typename T>
struct RNG1
{
    __device__
    RNG1(unsigned int tid) {
        curand_init(tid, tid, 0, &state);
    }

    __device__ T
    operator ()(void) {
        return curand_uniform(&state);
    }

    curandState state;
};

And here is my thrust solution:

template<typename T>
struct RNG2
{
    __device__
    RNG2(unsigned int tid)
        : gen(tid)
        , dis(0, 1) { gen.discard(tid); }

    __device__ T
    operator ()(void) {
        return dis(gen);
    }

    thrust::default_random_engine gen;
    thrust::uniform_real_distribution<T> dis;
};

The way I use them is the following:

template<typename T> __global__ void
mykernel(/* args here */)
{
    unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

    RNG1<T> rng(tid);
    // or
    RNG2<T> rng(tid);

    T a_random_number = rng();  

    // do stuff here
}

Both of them work but the cuRAND solution is much slower (more than 3 times slower). If I set the second parameter of curand_init (sequence number) to 0, then the performance is the same as that of the thrust solution, but the random numbers are "bad". I can see patterns and artefacts in the resulting distribution.

Here are my two questions:

Can someone explain to me why the cuRAND solution with a non-zero sequence is slower?
How can thrust be as fast as cuRAND with zero sequence, but also generate good random numbers?
While searching on Google, I noticed that most people use cuRAND, and very few use thrust to generate random numbers inside device code. Is there something I should be aware of? Am I misusing thrust?

Thank you.

score 3 · Accepted Answer

可能出现性能差异是因为 cuRAND 和 Thrust 使用具有不同性能配置文件和内存需求的不同 PRNG 算法。请注意，cuRAND 支持五种不同的 PRNG 算法，并且您的代码没有给出正在使用的算法。

Thrustdefault_random_engine目前是minstd_rand，但它的文档指出这“可能会在未来的版本中改变”。（在我写完我的之后写的评论也指出它是minstd_rand.）minstd_rand是一个简单的线性同余生成器，它可能比 PRNG cuRAND 使用的任何东西都快。

这是一条转换为答案并经过编辑的评论。

random - cuRAND performs much worse than thrust when generating random numbers inside CUDA kernels

1 回答 1

Related

Reference