profiling - 对我的 OpenCL 内核（DSP 窗口函数）的分析结果感到困惑

Question

我在 OpenCL 中完成了一个窗口函数内核。基本上，一个窗口函数只是将一组系数逐个应用于另一组数字（维基百科解释得更好）。在大多数情况下，我能够将窗口系数浮点数组填充到常量缓存中。

我希望 Compute Prof 的结果显示主机到设备和设备到主机的内存传输将占用 95% 以上的处理时间。对于我几乎所有的案例，它只占处理时间的 80%。我正在向电路板写入和读取一个 420 万浮点数组，并编写另一个通常保持在远低于 100 万的浮点数组。

内核中的任何内容看起来都很可疑吗？关于它是否应该首先在 GPU 上运行得比 CPU 更快的问题的任何意见（我仍然不是 100% 的）。我对为什么我的 gld_efficiency 和 gst_efficiency 徘徊在 0.1 和 0.2 之间感到有些惊讶。我制作这个内核时考虑到了 G80 全局内存合并。我的全局内存整体吞吐量在 40gbs 时似乎还不错。内核非常简单，发布在下面。

__kernel void window(__global float* inputArray, // first frame to ingest starts at 0.  Sized to nFramesToIngest*framesize samples
    __constant float* windowArray, // may already be partly filled
    int windowSize, // size of window frame, in floats
    int primitivesPerDataFrame, //amount of primitives in each frame of inputArray parameter
    int nInFramesThisCall, //each thread solves a frame, so this integer represent how many threads this kernel launches
    int isRealNumbers //0 for complex, non-zero for real 
)
{
int gid = get_global_id(0) + get_global_size(0) * get_global_id(1);

if(gid < nInFramesThisCall) //make sure we don't execute unnecessary threads
{
    if(isRealNumbers)
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize];
        }
    }
    else //complex
    {
        for(int i = 0; i < primitivesPerDataFrame; i++)
        {
            int inputArrayIndex = (gid*primitivesPerDataFrame)+i;
            inputArray[inputArrayIndex] = inputArray[inputArrayIndex] * windowArray[i%windowSize/2];
        }
    }
}

}

score 1 · Accepted Answer

您正在使用多少个线程（顺便说一下，OpenCL 术语是工作项）？您至少需要数百个才能有效地加载大型 GPU。

你说你想利用合并的内存访问，但是有一个偏移量的负载，比如

int inputArrayIndex = (gid*primitivesPerDataFrame)+i;

在大多数情况下不会使这成为可能。NVidia 的 G80 在合并方面有相当严格的限制，请参阅“OpenCL 最佳实践指南”了解更多信息。基本上，来自一个 warp 的工作项必须同时以某种方式访问 64 或 128 字节对齐块的元素，以使加载和存储发生合并。

或者举个例子：如果primitivesPerDataFrame是 16，则扭曲的加载和存储是在间隔 16 个元素的偏移处完成的，这使得任何有效的合并成为不可能。

profiling - 对我的 OpenCL 内核（DSP 窗口函数）的分析结果感到困惑

1 回答 1

Related

Reference