opencl - OpenCL 越界错误

Question

这个内核工作正常：

__kernel void test(__global float* a_Direction, __global float* a_Output, const unsigned int a_Count)
{
    int index = get_global_id(0);

    if (index < a_Count)
    {
        a_Output[index * 3 + 0] = a_Direction[index * 3 + 0] * 0.5f + 0.5f;
        a_Output[index * 3 + 1] = a_Direction[index * 3 + 1] * 0.5f + 0.5f;
        a_Output[index * 3 + 2] = a_Direction[index * 3 + 2] * 0.5f + 0.5f;
    }
}

此内核产生越界错误：

__kernel void test(__global float3* a_Direction, __global float3* a_Output, const unsigned int a_Count)
{
    int index = get_global_id(0);

    if (index < a_Count)
    {
        a_Output[index].x = a_Direction[index].x * 0.5f + 0.5f;
        a_Output[index].y = a_Direction[index].y * 0.5f + 0.5f;
        a_Output[index].z = a_Direction[index].z * 0.5f + 0.5f;
    }
}

在我看来，他们似乎都应该做同样的事情。但由于某种原因，这两种方法中只有一种有效。我错过了一些明显的东西吗？

确切的错误是：“在 GeForce GTX580M（设备 0）上执行 CL_COMMAND_READ_BUFFER 时出现 CL_OUT_OF_RESOURCES 错误。

score 2 · Accepted Answer

@arsenm 在他/她的回答中以及 @Darkzeros 给出了正确的解释，但我觉得发展一点很有趣。问题在于，在第二个内核中，这些是发生的“隐藏”对齐。正如第 6.1.5 节中的标准所述：

对于 3 分量向量数据类型，数据类型的大小为 4 * sizeof(component)。这意味着 3 分量矢量数据类型将与 4 * sizeof(component) 边界对齐。

让我们用一个例子来说明：

假设它a_Direction由 9 个浮点数组成，并且您使用 3 个线程/工作项来处理这些元素。在第一个内核中，这些都没有问题：线程 0 将处理索引为 0、1、2 的元素，线程 1 处理元素 3、4、5，最后，线程 2 处理元素 6、7、8：一切很好。

但是对于第二个内核，假设您使用的数据结构从主机端的角度来看保持不变（即从 0 到 8 的数组），线程 0 将处理元素 0、1、2（并且还将访问元素 4，因为 float3 类型向量的行为类似于 float4 类型向量，而无需对其进行任何操作）。第二个线程，即线程 1 不会访问元素 3、4、5，但会访问元素 4、5、6（和7 不做任何事情）。

因此，这就是问题出现的地方，线程 2 将尝试访问元素 8、9、10（和 11），因此超出了访问范围。

总而言之，3 个元素的向量的行为类似于 4 个元素的向量。

现在，如果你想在不改变主机端数据结构的情况下使用向量，你可以使用 3.12.7 节中描述的 vload3 和 vstore3 函数。的标准。像那样：

 vstore3(vload3(index, a_Direction) * 0.5f + 0.5f, index, a_Output));

顺便说一句，您不必费心诸如（假设正确对齐）之类的语句：

a_Output[index].x = a_Direction[index].x * 0.5f + 0.5f;
a_Output[index].y = a_Direction[index].y * 0.5f + 0.5f;
a_Output[index].z = a_Direction[index].z * 0.5f + 0.5f;

这条语句就足够了（不需要为每个元素写一行）：

a_Output[index] = a_Direction[index] * 0.5f + 0.5f;

score 2 · Accepted Answer

您可能遇到的问题是您为 float3s 分配了一个 n * 3 * sizeof(float) 的缓冲区，但 float3 的大小和对齐方式是 16，而不是 12。

opencl - OpenCL 越界错误

2 回答 2

Related

Reference