0

So I'm trying to make use of this custom RNG library for openCL: http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html

The library defines a state struct:

//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;

And in order to generate a random uint, you pass in the state into the following function:

uint MWC64X_NextUint(mwc64x_state_t *s)

which updates the state, so that when you pass it into the function again, the next "random" number in the sequence will be generated.

For the project I am creating I need to be able to generate random numbers not just in different work groups/items but also across multiple devices simultaneously and I'm having trouble figuring out the best way to design this. Like should I create 1 mwc64x_state_t object per device/commandqueue and pass that state in as a global variable? Or is it possible to create 1 state object for all devices at once? Or do I not even pass it in as a global variable and declare a new state locally within each kernel function?

The library also comes with this function:

void MWC64X_SeedStreams(mwc64x_state_t *s, ulong baseOffset, ulong perStreamOffset)

Which supposedly is supposed to split up the RNG into multiple "streams" but including this in my kernel makes it incredibly slow. For instance, if I do something very simple like the following:

__kernel void myKernel()
{
    mwc64x_state_t rng;
    MWC64X_SeedStreams(&rng, 0, 10000);
}

Then the kernel call becomes around 40x slower.

The library does come with some source code that serves as example usages but the example code is kind of limited and doesn't seem to be that helpful.

So if anyone is familiar with RNGs in openCL or if you've used this particular library before I'd very much appreciate your advice.

4

1 回答 1

3

MWC64X_SeedStreams 函数确实相对较慢,至少与 MWC64X_NextUint 调用相比,但大多数并行 RNG 都是如此,它们试图将大型全局流拆分为许多可以并行使用的子流。假设您将在内核中多次调用 NextUint(例如一百次或更多),但 SeedStreams 仅在顶部。

这是库附带的 EstimatePi 示例的注释版本(mwc64x/test/estimate_pi.cpp 和 mwc64x/test/test_mwc64x.cl)。

__kernel void EstimatePi(ulong n, ulong baseOffset, __global ulong *acc)
{
    // One RNG state per work-item
    mwc64x_state_t rng;

    // This calculates the number of samples that each work-item uses
    ulong samplesPerStream=n/get_global_size(0);

    // And then skip each work-item to their part of the stream, which
    // will from stream offset:
    //   baseOffset+2*samplesPerStream*get_global_id(0)
    // up to (but not including):
    //   baseOffset+2*samplesPerStream*(get_global_id(0)+1)
    //
    MWC64X_SeedStreams(&rng, baseOffset, 2*samplesPerStream);


    // Now use the numbers
    uint count=0;
    for(uint i=0;i<samplesPerStream;i++){
        ulong x=MWC64X_NextUint(&rng);
        ulong y=MWC64X_NextUint(&rng);
        ulong x2=x*x;
        ulong y2=y*y;
        if(x2+y2 >= x2)
            count++;
    }
    acc[get_global_id(0)] = count;
}

所以意图是 n 应该很大并且随着工作项数量的增长而增长,以便 samplesPerStream 保持在一百或更多左右。

如果您想要多个设备上的多个内核,那么您需要在流拆分中添加另一个层次结构,例如,如果您有:

  • K :设备数量(可能在并行机器上)
  • W : 每个设备的工作项数
  • C : 每个工作项对 NextUint 的调用次数

您最终会在所有工作项中对 NextUint 的总调用次数为 N=K W C。如果您的设备被标识为 k=0..(K-1),那么在每个内核中您将执行以下操作:

MWC64X_SeedStreams(&rng, W*C*k, C);

那么流中的索引将是:

[0             .. N ) : Parts of stream used across all devices
[k*(W*C)       .. (k+1)*(W*C) )    : Used within device k
[k*(W*C)+(i*C) .. (k*W*C)+(i+1)*C ) : Used by work-item i in device k.

如果每个内核使用的 C 样本少于 C 样本,那很好,如有必要,您可以高估。

(我是图书馆的作者)。

于 2014-10-14T13:05:58.343 回答