c++ - Fill an array or a list in CUDA kernel but not in every thread

Question

Basically, I have an if() in my kernel and if the condition is verified I would like to store a new value in dynamic list or array. The problem is that I can't use the threadIdx because it will not be filled in every kernel.

Something like :

__global__ void myKernel(customType *c)
{
    int i = threadIdx.x;
    //whatever
    if(condition)
        c->pop(newvalue)
}

In fact I would like to avoid a c[i]=newvalue because at the end I would need to check every c[i] if a value was inserted or not with a for loop in the host code and to fill properly another structure. I thought about thrust but it seems to be an overkill for my "simple" problem.

Hope you can help me find a workaround.

score 5 · Accepted Answer

如果我正确理解了您的问题，您有两个选择。

第一个是为每个线程预先分配一个输出位置，并且只有一些线程写入它们的输出。这会给您留下一个带有间隙的输出。您可以使用流压缩来消除差距，这是 CUDA 中已解决的问题 - 快速 google 搜索会出现许多选项，并且 Thrust 和 CUDPP 都具有您可以使用的压缩功能。

第二种选择是使用全局内存计数器，并让每个线程在使用输出流中的位置时自动递增计数器，因此类似于：

unsigned int opos; // set to zero before call

__global__ void myKernel(customType *c)
{
    //whatever
    if(condition) {
        unsigned int pos = atomicAdd(&opos, 1);
        c[pos] = newval;
    }
}

如果您有 Kepler 卡，并且预期发出输出的线程数很少，则第二种选择可能会更快。如果不是这种情况，流压缩可能是更好的选择。

score 4 · Accepted Answer

如果我理解正确，您描述的是流压缩。一些，并非所有线程都会创建一个值，并且您希望将这些值存储在一个数组中而没有任何间隙。

实现这一点的一种方法是使用Thrust中可用的流压缩算法（查看此示例）。请注意，这确实需要您分两次执行该操作。

如果您在单个线程块（而不是整个网格）中执行此操作，那么您还可以查看CUB。每个线程将计算一个标志，指示它是否要存储一个值，对标志进行前缀求和以确定每个线程在列表中的偏移量，然后进行存储。

c++ - Fill an array or a list in CUDA kernel but not in every thread

2 回答 2

Related

Reference