opencl - 如何在 OpenCL 中使用本地内存？

Question

我最近一直在玩 OpenCL，我能够编写只使用全局内存的简单内核。现在我想开始使用本地内存，但我似乎无法弄清楚如何一次使用get_local_size()和get_local_id()计算一个“块”输出。

例如，假设我想将 Apple 的 OpenCL Hello World 示例内核转换为使用本地内存的东西。你会怎么做？这是原始内核源代码：

__kernel square(
    __global float *input,
    __global float *output,
    const unsigned int count)
{
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];
}

如果这个例子不能轻易地转换成展示如何使用本地内存的东西，任何其他简单的例子都可以。

score 32 · Accepted Answer

查看 NVIDIA 或 AMD SDK 中的示例，它们应该会为您指明正确的方向。例如，矩阵转置将使用本地内存。

使用平方内核，您可以将数据暂存到中间缓冲区中。记得传入附加参数。

__kernel square(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        // if the threads were reading data from other threads, then we would
        // want a barrier here to ensure the write completes before the read
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}

score 30 · Accepted Answer

如果本地内存的大小是恒定的，则还有另一种可能性。不使用内核参数列表中的指针，可以在内核中声明本地缓冲区，只需将其声明为 __local：

__local float localBuffer[1024];

由于 clSetKernelArg 调用较少，这将删除代码。

score 5 · Accepted Answer

在 OpenCL 中，本地内存意味着在工作组中的所有工作项之间共享数据。并且通常需要在本地内存数据可以使用之前进行屏障调用（例如，一个工作项要读取由其他工作项写入的本地内存数据）。Barrier 的硬件成本很高。请记住，本地内存应该用于重复数据读/写。应尽可能避免银行冲突。

如果您不小心使用本地内存，那么您可能会在一段时间内获得比使用全局内存更差的性能。

opencl - 如何在 OpenCL 中使用本地内存？

3 回答 3

Related

Reference