c++ - OpenCL FFT 实现 - 无意义的输出数据 - 大概是正确的算法

Question

我在创建 FFT 信号互相关模块时遇到了一些问题（利用循环卷积定理等）。我想确认以下方案将确保 FFT 蝶形计算的特定递归级别在下一个级别开始之前完成，并且包含数据的缓冲区已完全写入/完成。因此，循环相关/卷积涉及 FFT、矢量内积和 IFFT。

由于这种方案，我没有按位反转索引顺序对数据进行排序的内核。前向 FFT 内核产生位反转顺序 FFT，在内积之后，IFFT 仅使用此结果来计算自然顺序解。

我应该提到我有多个 GPU。

无论如何，这里是每个 FFT/IFFT 的伪代码表示，（访问/操作算法是等价的，除了共轭旋转因子，归一化内核稍后出现：

    for numDevices:
        data -> buffers
        buffers -> kernel arguments

    for fftRecursionLevels:
        for numDevices:
            recursionLevel -> kernel arguments
            deviceCommandQueue -> enqueueNDRangeKernel
            deviceCommandQueue ->  flush()

        for numDevices:
            deviceCommandQueue -> finish()

（编辑：方法是 Radix-2 DIT，如果不清楚，对不起。）

我能逃脱惩罚吗？据我了解，finish() 是一个阻塞函数，直到每个内核完成其全局范围内的计算，最后一个 for 循环才会完成（此处为 fftSize / 2，请参阅有关 Radix-2 蝶形操作的任何文献），并且，对于奖励积分，由于flush（），一些内核已经在执行，而我正在将剩余的内核排入队列。

总的来说，对于这个特定的软件，我使用 openCL/c++ 得到了一些时髦/垃圾的结果。我已经在 python 中实现了完整的数据管道，（如果你愿意的话，算法是“拓扑等效的”，显然没有主机<-->设备缓冲区/指令或设备端操作 w/python 方法），并模拟当我使用 scipy.fftpack 模块并仅对信号数据向量进行操作时，内核应该如何运行并产生相同的结果。

我想一些图片会有所帮助。这正是这两个程序中正在发生的事情。

1) 生成高斯向量 2) 零填充高斯向量到下一个 2 长度的最高幂 3) 前向 FFT，产生自然顺序（以 w 为单位）结果 4) 绘图

这是我的内核的 python 模拟，与仅使用 scipy.fftpack.fft(vector) 相比：

http://i.imgur.com/pGcYTrL.png

他们是一样的。现在将其与以下任何一个进行比较：

http://i.imgur.com/pbiYGpR.png

（忽略x轴上的indeces，它们都是自然顺序FFT结果）

它们都是相同类型的起始数据（从 0 到 N 的高斯，以 N/2 为中心，在这种情况下，零填充到 2N）。它们都应该看起来像图片一中的绿/蓝线，但事实并非如此。由于我一直盯着第二个程序的主机/设备代码，我的眼睛已经呆滞了，我没有看到任何拼写错误或不正确的算法。我高度怀疑设备端正在发生一些我不知道的事情，因此我在这里发帖。很明显，该算法看起来运行正常，（无论起始数据如何，红色/红色的一般形状都近似于蓝色/绿色。我已经在不同的起始集上运行了该算法，它始终看起来像蓝色/绿色，但是有那种无意义的噪音/错误），但有些不对劲。

所以我转向互联网。提前致谢。

编辑：下面的一位发帖人建议很难在没有看到至少设备端代码的情况下发表评论，因为有关于 mem fencing 的问题，所以我在下面发布了内核代码。

//fftCorr.cl
//
//OpenCL Kernels/Methods for FFT Cross Correlation of Signals
//
//Copyright (C) 2013  Steve Novakov
//   
//This file is part of OCLSIGPACK.
//
//OCLSIGPACK is free software: you can redistribute it and/or modify
//it under the terms of the GNU General Public License as published by
//the Free Software Foundation, either version 3 of the License, or
//(at your option) any later version.
//
//OCLSIGPACK is distributed in the hope that it will be useful,
//but WITHOUT ANY WARRANTY; without even the implied warranty of
//MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
//GNU General Public License for more details.
//
//You should have received a copy of the GNU General Public License
//along with OCLSIGPACK.  If not, see <http://www.gnu.org/licenses/>.
//


#define PIE 3.14159265359f


void DFT2(float2 * a,float2 * b, float2 *w){
    float2 tmp; 
    float2 bmul = ( (*w).x*((*b).x) - (*w).y*((*b).y),  (*w).x*((*b).y) + (*w).y*((*b).x) );
    tmp = (*a) - bmul; 
    (*a) += bmul; 
    (*b) = tmp; 
}
//
//
// Spin Factor Calc
// 
// Computes spin/twiddle factor for particular bit reversed index.
//
//

float2 spinFact(unsigned int N, unsigned int k)
{
    float phi = -2.0 * PIE * (float) k / (float)  N;
    // \bar{w}^offset_(groupDist)   

    float spinRe, spinIm;
    spinIm = sincos( phi, &spinRe); 

    return (float2) (spinRe, spinIm);   
}


float2 spinFactR(unsigned int N, unsigned int k)
{
    float phi = 2.0 * PIE * (float) k / (float)  N;
    // w^offset_(groupDist) 

    float spinRe, spinIm;
    spinIm = sincos( phi, &spinRe); 

    return (float2) (spinRe, spinIm);   
}


//
// Bit-Reversed Index Reversal, (that sounds confusing)
//
unsigned int BRIR( unsigned int index, unsigned int fftDepth)
{
    unsigned int rev = index;   

    rev = (((rev & 0xaaaaaaaa) >> 1 ) | ((rev & 0x55555555) << 1 ));
    rev = (((rev & 0xcccccccc) >> 2 ) | ((rev & 0x33333333) << 2 ));
    rev = (((rev & 0xf0f0f0f0) >> 4 ) | ((rev & 0x0f0f0f0f) << 4 ));
    rev = (((rev & 0xff00ff00) >> 8 ) | ((rev & 0x00ff00ff) << 8 ));
    rev = (((rev & 0xffff0000) >> 16) | ((rev & 0x0000ffff) << 16));

    rev >>= (32-fftDepth);  

    return rev;
}


//
//
// Index Bit Reversal Kernel, if Necessary/for Testing.
//
// Maybe I should figure out an in-place swap algorithm later.
//
//
__kernel void bitRevKernel(     __global float2 * fftSetX,
                                __global float2 * fftSetY,
                                __global float2 * fftRevX,
                                __global float2 * fftRevY, 
                                unsigned int fftDepth
                            )
{
    unsigned int glID = get_global_id(0);

    unsigned int revID = BRIR(glID, fftDepth);

    fftRevX[revID] = fftSetX[glID];
    fftRevY[revID] = fftSetY[glID];

}


//
//
// FFT Radix-2 Butterfly Operation Kernel
//
// This is an IN-PLACE algorithm. It calculates both bit-reversed indeces and spin factors in the same thread and
// updates the original set of data with the "butterfly" results.
// 
// recursionLevel is the level of recursion of the butterfly operation 
// # of threads is half the vector size N/2, (glID is between 0 and this value, non inclusive)
//
// Assumes natural order data input. Produces bit-reversed order FFT output.
//
//
__kernel void fftForwKernel(    __global float2 * fftSetX,
                                __global float2 * fftSetY,
                                unsigned int recursionLevel,
                                unsigned int totalDepth
                            )
{

    unsigned int glID = get_global_id(0);

    unsigned int gapSize = 1 << (recursionLevel - 1);
    unsigned int groupSize = 1 << recursionLevel;
    unsigned int base = (glID >> (recursionLevel - 1)) * groupSize;
    unsigned int offset = glID & (gapSize - 1 );

    unsigned int bitRevIdA = (unsigned int) base + offset;
    unsigned int bitRevIdB = (unsigned int) bitRevIdA + gapSize;

    unsigned int actualIdA = BRIR(bitRevIdA, totalDepth);
    unsigned int actualIdB = BRIR(bitRevIdB, totalDepth);


    float2 tempXA = fftSetX[actualIdA];
    float2 tempXB = fftSetX[actualIdB];

    float2 tempYA = fftSetY[actualIdA];
    float2 tempYB = fftSetY[actualIdB];                 

    float2 spinF = spinFact(groupSize, offset);

    // size 2 DFT   
    DFT2(&tempXA, &tempXB, &spinF); 
    DFT2(&tempYA, &tempYB, &spinF);


    fftSetX[actualIdA] = tempXA;
    fftSetX[actualIdB] = tempXB;

    fftSetY[actualIdA] = tempYA;
    fftSetY[actualIdB] = tempYB;    

}

对于图片中提供的数据。我按照帖子开头所述运行“fftForwKernel”，然后运行“bitRevKernel”

score 1 · Accepted Answer

因此，如果没有代码来通知任何内容，并且在代码确实“相同”的假设下运行，我倾向于说假设 Python 和 OpenCL 之间使用的算法看起来确实相同，这可能是一个同步问题. 如果不是全局同步问题（正确拆分内核调用之间的工作），那么每个内核调用的每个本地组都缺少全局内存栅栏甚至本地内存栅栏的问题。

你是如何组织电话的？关于您拆分递归 FFT 的确切程度，提供的伪代码似乎模棱两可。我猜您正在为...做“正确的事情”……好吧，我什至不确定您是在做 DIT 还是 DIF 或 FFT 算法可用的大量数据流中的任何其他。据我所知，这可能是您在没有正确存储它们的情况下做蝴蝶，或者您可能非常严格地同步您的 FFT 步骤，以至于蝴蝶是与递归 FFT 步骤完全不同的内核调用的一部分，我的建议是完全无效的和无效。

（我会将此淡化为评论，但我缺乏这样做的声誉，所以我很抱歉将其发布为“答案”）

c++ - OpenCL FFT 实现 - 无意义的输出数据 - 大概是正确的算法

1 回答 1

Related

Reference