arrays - CUDA数组元素移位操作

Question

我目前正在 CUDA 上执行数组移位操作，但我被困在我需要在 GPU 上并行化操作的部分（我已经为 CPU 完成了它）。因此，该操作基本上是在数组中移动元素。

因此，例如，如果我有一个 M × N 矩阵，对于每一行，如果我看到 -1，我会将 -1 替换为它旁边的元素，依此类推，直到我到达行尾，我需要对所有列并行执行此操作。

举个简单的例子：

 3  4  1 -1  5  6  7  8
-1  4  5  2  1  2  5  2
 2  4  5  1  2  3  4 -1

对于该矩阵，生成的矩阵将是：

 3  4  1  5  6  7  8  8
 4  5  2  1  2  5  2  2
 2  4  5  1  2  3  4 -1

PS。最后一个元素保持不变，因为它到达了无法替换的边界。此外，每一行只会出现一个-1

所以，这基本上是操作，但我的问题是如何为每一行分配一个线程或.. 并行化所有行并在 cuda 中同时进行转换？此外，我的数组使用等式从二维数组转换为一维数组

array1d[i+width*j]  =  array2d[i][j];

到目前为止，我已经尝试过：

__global__ void gpu_shiftArray(int *Arr, int *location, int width, int height)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;

int index = i+width*j;

//shift when I see -1
if(Arr[index] == -1)
{
    Arr[index] = (index % height) ? Arr[index+1] : 
    }
    //location stores the index of -1, so anything after the -1 will be shifted too
if((location[i]+width*j) <= index)
{
    Arr[index] = (index % height) ? Arr[index+1] : 
}
}

它的输出并不完全正确（相差 5-10 个值），但我不确定为什么也不知道我做错了什么。

score 1 · Accepted Answer

这看起来可以通过稍微修改的“流压缩”算法来完成，该算法使用“谓词总和”作为原语。有关详细信息，请参阅以下链接：使用 CUDA 的并行前缀和（扫描）。

唔。我可以看到，使用选票函数（将源数据与 -1 进行比较）和一些位算术来确定扭曲线程在进行复制时如何（以及是否）选择目标偏移量也可能有优势。

arrays - CUDA数组元素移位操作

1 回答 1

Related

Reference