c++ - 增加线程数时出现 CUDA 内核错误

Question

我正在开发一个 CUDA 射线平面相交内核。

假设，我的平面（面）结构是：

typedef struct _Face {
    int ID;
    int matID;

    int V1ID;
    int V2ID;
    int V3ID;

    float V1[3];
    float V2[3];
    float V3[3];

    float reflect[3];

    float emmision[3];
    float in[3];
    float out[3];

    int intersects[RAYS];

} Face;

我粘贴了整个结构，以便您了解它的大小。在当前配置中，RAYS等于625 。在下面的代码中，假设 faces 数组的大小为 1270（通常为数千）。

现在直到今天我都以一种非常天真的方式启动了我的内核：

const int tpb = 64; //threads per block
dim3 grid = (n +tpb-1)/tpb; // n - face count in array
dim3 block = tpb;
//.. some memory allocation etc.
theKernel<<<grid,block>>>(dev_ptr, n);

在内核内部我有一个循环：

__global__ void theKernel(Face* faces, int faceCount) {
    int offset = threadIdx.x + blockIdx.x*blockDim.x;
    if(offset >= faceCount)
        return;
    Face f = faces[offset];
    //..some initialization
    int RAY = -1;
    for(float alpha=0.0f; alpha<=PI; alpha+= alpha_step ){ 
        for(float beta=0.0f; beta<=PI; beta+= beta_step ){ 
            RAY++;
            //..calculation per ray in (alpha,beta) direction ...
            faces[offset].intersects[RAY] = ...; //some assignment

这是关于它的。我遍历了所有方向并更新了faces数组。我工作正常，但几乎不比 CPU 代码快。

所以今天我尝试优化代码，并使用更多的线程启动内核。而不是每个面有1 个线程，我希望每个面的射线有 1 个线程（意味着 625 个线程适用于 1 个面）。修改很简单：

dim3 grid = (n*RAYS +tpb-1)/tpb;  //before launching . RAYS = 625, n = face count

和内核本身：

__global__ void theKernel(Face *faces, int faceCount){

int threadNum = threadIdx.x + blockIdx.x*blockDim.x;

int offset = threadNum/RAYS; //RAYS is a global #define
int rayNum = threadNum - offset*RAYS;

if(offset >= faceCount || rayNum != 0)
    return;

Face f = faces[offset];
//initialization and the rest.. again ..

而且这段代码根本不起作用。为什么？从理论上讲，只有第一个线程（每个面 625 个）应该工作，那么为什么这会导致糟糕的（几乎没有）计算？

亲切的问候，例如。

score 0 · Accepted Answer

正如散热器指出的那样，您可能超出了可用资源。好主意是在内核执行后检查是否没有错误。

这是我使用的 C++ 代码：

#include <cutil_inline.h>

void
check_error(const char* str, cudaError_t err_code) {
    if (err_code != ::cudaSuccess)
        std::cerr << str << " -- " << cudaGetErrorString(err_code) << "\n";
}

然后当我调用内核时：

my_kernel <<<block_grid, thread_grid >>>(args);
check_error("my_kernel", cudaGetLastError());

score 0 · Accepted Answer

任何维度中网格的最大尺寸为 65535（CUDA 编程指南，附录 F）。如果您的网格大小在更改之前为 1000，则您已将其增加到 625000。这超过了限制，因此内核将无法正常运行。

如果将网格大小定义为

dim3 grid((n + tpb - 1) / tpb, RAYS);

那么所有网格尺寸都将小于限制。您还必须更改blockIdx内核中使用的方式。

c++ - 增加线程数时出现 CUDA 内核错误

2 回答 2

Related

Reference