random - 使 CURAND 从均匀分布中生成不同的随机数

Question

我正在尝试使用 CURAND 库生成从 0 到 100 完全相互独立的随机数。因此，我将时间作为每个线程的种子并指定“id = threadIdx.x + blockDim.x * blockIdx.x "作为序列和偏移量。然后将随机数设为浮点数后，将其乘以 100 并取其整数值。

现在，我面临的问题是它为线程 [0,0] 和 [0,1] 获得相同的随机数，无论我运行多少次 11 的代码。我无法理解是什么我做错了。请帮忙。

我在下面粘贴我的代码：

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include<curand_kernel.h>
#include "util/cuPrintf.cu"
#include<time.h>

#define NE WA*HA //Total number of random numbers 
#define WA 2   // Matrix A width
#define HA 2   // Matrix A height
#define SAMPLE 100 //Sample number
#define BLOCK_SIZE 2 //Block size

__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
int id = threadIdx.x  + blockIdx.x + blockDim.x;
curand_init ( seed, id , id, &state[id] );
}

__global__ void generate( curandState* globalState, float* randomMatrix )
{
int ind = threadIdx.x + blockIdx.x * blockDim.x;
if(ind < NE){
    curandState localState = globalState[ind];
    float stopId = curand_uniform(&localState) * SAMPLE;
    cuPrintf("Float random value is : %f",stopId);
    int stop = stopId ;
    cuPrintf("Random number %d\n",stop);
    for(int i = 0; i < SAMPLE; i++){
            if(i == stop){
                    float random = curand_normal( &localState );
                    cuPrintf("Random Value %f\t",random);
                    randomMatrix[ind] = random;
                    break;
            }
    }
    globalState[ind] = localState;
}
}

/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////

int main(int argc, char** argv)
{

// 1. allocate host memory for matrix A
unsigned int size_A = WA * HA;
unsigned int mem_size_A = sizeof(float) * size_A;
float* h_A = (float* ) malloc(mem_size_A);
time_t t;

// 2. allocate device memory
float* d_A;
cudaMalloc((void**) &d_A, mem_size_A);

// 3. create random states    
curandState* devStates;
cudaMalloc ( &devStates, size_A*sizeof( curandState ) );

// 4. setup seeds
int n_blocks = size_A/BLOCK_SIZE;
time(&t);
printf("\nTime is : %u\n",(unsigned long) t);
setup_kernel <<< n_blocks, BLOCK_SIZE >>> ( devStates, (unsigned long) t );
// 4. generate random numbers
cudaPrintfInit();
generate <<< n_blocks, BLOCK_SIZE >>> ( devStates,d_A );
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
// 5. copy result from device to host
cudaMemcpy(h_A, d_A, mem_size_A, cudaMemcpyDeviceToHost);


// 6. print out the results
printf("\n\nMatrix A (Results)\n");
for(int i = 0; i < size_A; i++)
{
   printf("%f ", h_A[i]);
   if(((i + 1) % WA) == 0)
      printf("\n");
}
printf("\n");

// 7. clean up memory
free(h_A);
cudaFree(d_A);

}

我得到的输出是：

时间为：1347857063 [0, 0]：浮点随机值为：11.675105[0, 0]：随机数 11 [0, 0]：随机值 0.358356 [0, 1]：浮点随机值为：11.675105[0, 1 ]：随机数 11 [0, 1]：随机值 0.358356 [1, 0]：浮点随机值是：63.840496[1, 0]：随机数 63 [1, 0]：随机值 0.696459 [1, 1]：浮点随机值为：44.712799[1, 1]：随机数 44 [1, 1]：随机值 0.735049

score 4 · Accepted Answer

这里有一些问题，我在这里解决第一个问题以帮助您入门：

一般要点

请检查所有 CUDA API 调用的返回值，请参阅此处了解更多信息。
请运行 cuda-memcheck 来检查明显的事情，比如越界访问。

具体点

When allocating space for the RNG state, you should have space for one state per thread (not one per matrix element as you have now).
Your thread ID calculation in setup_kernel() is wrong, should be threadIdx.x + blockIdx.x * blockDim.x (* instead of +).
You use the thread ID as the sequence number as well as the offset, you should just set the offset to zero as described in the cuRAND manual:

For the highest quality parallel pseudorandom number generation, each experiment should be assigned a unique seed. Within an experiment, each thread of computation should be assigned a unique sequence number.

最后你每个块运行两个线程，这是非常低效的。查看 CUDA C 编程指南，在“最大化利用率”部分以获取更多信息，但您应该寻找每个块启动 32 个线程的倍数（例如 128、256）和大量块（例如数万）。如果您的问题很小，那么考虑一次运行多个问题（在单个内核启动中批处理或作为不同流中的内核以获得并发执行）。

random - 使 CURAND 从均匀分布中生成不同的随机数

1 回答 1

Related

Reference