cuda - 简单的线性变换算法不起作用

Question

__global__ 
void transpose(double *input, double *output, int *width, int *height) 
{
    int threadidx = (blockIdx.x * blockDim.x) + threadIdx.x;
    int row = threadidx / (*width);
    int column = (threadidx+3) % (*height);
    output[column * (*height) + row] = input[threadidx];
}

以上是我的线性变换内核。对于 [0, 1, 2, 3, 4, 5, 6, 7, 8] 的输入矩阵，输出矩阵应该是 [0, 3, 6, 1, 4, 7, 2, 5, 8]，但是当我使用上述示例运行此代码时，输出为 [0, 3, 6, 0, 0, 0, 0, 0, 0]。我已经用 Python 编写了该算法的串行实现，并且它可以工作。我唯一能想到的是某种线程内存访问问题。有什么帮助吗？谢谢。

score 1 · Accepted Answer

正如评论已经指出的那样，您的代码恰好适用于您确定的示例输入案例：

[0, 1, 2, 3, 4, 5, 6, 7, 8]

如果你没有得到你所指示的结果，那么错误就在你显示的代码之外。但是，您似乎正在尝试转置数组。

此代码不适用于一般情况（例如尝试 2x2 数组[0, 1, 2, 3]：）

如果您打算转置数组，则这行代码尤其不正确：

    int column = (threadidx+3) % (*height);

如果您将其更改为：

    int column = (threadidx) % (*width);

您的代码将为各种矩阵大小生成正确的转置结果。

cuda - 简单的线性变换算法不起作用

1 回答 1

Related

Reference