cuda - 为什么 cuFFT 性能会因输入重叠而受到影响？

Question

我正在尝试使用 cuFFT 的回调功能即时执行输入格式转换（例如，计算 8 位整数输入数据的 FFT，而无需先将输入缓冲区显式转换为float. 在我的许多应用程序中，我需要计算输入缓冲区上的重叠FFT，如上一个 SO 问题中所述。通常，相邻的 FFT 可能重叠 FFT 长度的 1/4 到 1/8。

cuFFT 具有类似 FFTW 的接口，通过函数的idist参数cufftPlanMany()明确支持这一点。具体来说，如果我想计算大小为 32768 且连续输入之间有 4096 个样本重叠的 FFT，我将设置idist = 32768 - 4096. 从产生正确输出的意义上说，这确实可以正常工作。

但是，以这种方式使用 cuFFT 时，我看到了奇怪的性能下降。我设计了一个测试，以两种不同的方式实现这种格式转换和重叠：

明确告诉 cuFFT 输入的重叠性质：idist = nfft - overlap如上所述设置。安装一个加载回调函数，它只是根据需要在提供给回调的缓冲区索引上执行从int8_t到的转换。float
不要告诉 cuFFT 输入的重叠性质；骗它一个 dset idist = nfft。然后，让回调函数通过计算应该为每个 FFT 输入读取的正确索引来处理重叠。

此 GitHub gist 中提供了一个通过时序和等效测试实现这两种方法的测试程序。为简洁起见，我没有在这里全部复制。该程序计算了一批 1024 个 32768 点的 FFT，它们重叠了 4096 个样本；输入数据类型是 8 位整数。当我在我的机器上运行它时（使用 Geforce GTX 660 GPU，在 Ubuntu 16.04 上使用 CUDA 8.0 RC），我得到以下结果：

executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec

方法 2 明显更快，这是我没想到的。查看回调函数的实现：

方法一：

template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index, 
    void *, void *)
{
    return (cufftReal)(((const T *) inbuf)[fft_index]);
}

方法二：

template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf, 
    size_t fft_index, void *, void *)
{
    // fft_index is the index of the sample that we need, not taking 
    // the overlap into account. Convert it to the appropriate sample 
    // index, considering the overlap structure. First, grab the FFT 
    // parameters from constant memory.
    int nfft = overlap_params.nfft;
    int overlap = overlap_params.overlap;
    // Calculate which FFT in the batch that we're reading data for. This
    // tells us how much overlap we need to account for. Just use integer 
    // arithmetic here for speed, knowing that this would cause a problem 
    // if we did a batch larger than 2Gsamples long.
    int fft_index_int = fft_index;
    int fft_batch_index = fft_index_int / nfft;
    // For each transform past the first one, we need to slide "overlap" 
    // samples back in the input buffer when fetching the sample.
    fft_index_int -= fft_batch_index * overlap;
    // Cast the input pointer to the appropriate type and convert to a float.
    return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}

方法 2 有一个复杂得多的回调函数，甚至涉及到整数除以非编译时间值！我希望这比方法 1 慢得多，但我看到的是相反的情况。对此有很好的解释吗？当输入重叠时，cuFFT 的处理结构是否可能大不相同，从而导致性能下降？

如果可以从回调中删除索引计算（但这需要将重叠指定为 cuFFT），我似乎应该能够实现比方法 2 快得多的性能。

编辑：在下运行我的测试程序后nvvp，我可以看到 cuFFT 显然似乎以不同的方式构建其计算。很难理解内核符号名称，但内核调用分解如下：

方法一：

__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE: 3.72 毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 7.71 毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>：12.75 毫秒（是的，它被调用了两次）
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.49 毫秒

方法二：

spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>: 5.15 毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.88 毫秒
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.51 毫秒

有趣的是，看起来 cuFFT 调用两个内核来实际使用方法 1 计算 FFT（当 cuFFT 知道重叠时），但使用方法 2（它不知道 FFT 重叠），它只用一。对于在这两种情况下使用的内核，它似乎在方法 1 和 2 之间使用了相同的网格参数。

我不明白为什么它应该在这里使用不同的实现，尤其是在输入 stride 之后istride == 1。在转换输入处获取数据时，它应该只使用不同的基地址；我认为算法的其余部分应该完全相同。

编辑2：我看到一些更奇怪的行为。我偶然意识到，如果我未能适当地破坏 cuFFT 手柄，我会看到测量性能的差异。例如，我修改了测试程序以跳过破坏 cuFFT 句柄，然后以不同的顺序执行测试：方法 1、方法 2、然后方法 2 和方法 1。我得到以下结果：

executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec

因此，在为测试用例创建计划时，性能似乎会根据是否存在其他 cuFFT 计划而发生变化！使用分析器，我看到内核启动的结构在两种情况下没有变化；内核似乎都执行得更快。我对这种影响也没有合理的解释。

score 2 · Accepted Answer

如果您指定非标准步幅（批处理/转换无关紧要），cuFFT 在内部使用不同的路径。

广告编辑 2：这可能是 GPU Boost 调整 GPU 上的时钟。cuFFT 计划不会相互影响

获得更稳定结果的方法：

运行预热内核（任何可以使 GPU 完整工作的东西都很好）然后你的问题
增加批量大小
多次运行测试并取平均值
锁定 GPU 的时钟（在 GeForce 上不太可能 - 特斯拉可以做到）

score 1 · Accepted Answer

在@llukas 的建议下，我向 NVIDIA 提交了关于该问题的错误报告（如果您已注册为开发人员，则为https://partners.nvidia.com/bug/viewbug/1821802 ）。他们承认重叠计划的表现较差。他们实际上表示在这两种情况下使用的内核配置都不是最理想的，他们计划最终改进这一点。没有给出 ETA，但它可能不会出现在下一个版本中（上周刚刚发布了 8.0）。最后，他们说，从 CUDA 8.0 开始，没有解决方法可以让 cuFFT 使用跨步输入的更有效方法。

cuda - 为什么 cuFFT 性能会因输入重叠而受到影响？

2 回答 2

Related

Reference