cuda - 多次调用 cufft 商店回调

Question

我在复杂到复杂的异地一维批量 FFT 中使用袖带存储回调（即我正在执行许多相同大小的一维 FFT）。从文档的第 2.9.4 节开始，我希望每个输出都只调用一次此回调。特别是从链接中逐字引用的这句话：

cuFFT 将为输入中的每个点调用一次且仅一次的加载回调例程。类似地，它将为输出中的每个点调用一次存储回调例程，并且仅一次。

尽管如此，我似乎有一个与此相矛盾的例子。在下面的代码中，我希望看到每个数字 0-19 恰好出现一次，对应于每个输出样本只调用一次存储回调。但是，当我执行 504 个大小为 32 的 1D FFT 时，每个输出都会调用存储回调两次！

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#include <cuda.h>    
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>


// Very simple store callback: prints the index and does the store
static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
                               void *cb_info, void *sharedmem) {

    // Print the index. Each index should appear exactly once.
    if (index < 20) printf("%8llu\n", index);

    // Do the store
    ((cufftComplex *)a)[index] = z;
}
__device__ cufftCallbackStoreC stor_cb_ptr_d = stor_cb;


int main() {
    size_t work_size;

    // With these parameters, the store callback is
    // called twice for each output
    int fft_sz = 32;            // Size of each FFT
    int num_ffts = 504;         // How many FFTs to do

    // With these parameters, the store callback is
    // called once for each output
//    int fft_sz = 1024;         // Size of each FFT
//    int num_ffts = 20;         // How many FFTs to do

    // Buffers
    cufftComplex *in_buf_h, *in_buf_d, *out_buf_d;

    // Allocate buffers on host and device
    in_buf_h = new cufftComplex[fft_sz*num_ffts];
    cudaMalloc(&in_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
    cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));

    // Fill input buffer with zeros and copy to device
    memset(in_buf_h, 0, fft_sz*num_ffts*sizeof(cufftComplex));
    cudaMemcpy(in_buf_d, in_buf_h, fft_sz*num_ffts*sizeof(cufftComplex), cudaMemcpyHostToDevice);

    // Plan num_ffts of size fft_sz
    cufftHandle plan;
    cufftCreate(&plan);
    cufftMakePlan1d(plan, fft_sz, CUFFT_C2C, num_ffts, &work_size);

    // Associate save callback with plan
    cufftCallbackStoreC stor_cb_ptr_h;
    cudaMemcpyFromSymbol(&stor_cb_ptr_h, stor_cb_ptr_d, sizeof(stor_cb_ptr_h));
    cufftXtSetCallback(plan, (void **)&stor_cb_ptr_h, CUFFT_CB_ST_COMPLEX, 0);

    // Execute the plan. We don't actually care about values. The idea
    // is that the store callback should be called exactly once for
    // each of the fft_sz*num_ffts samples.
    cufftExecC2C(plan, in_buf_d, out_buf_d, -1);

    // Sync the device to flush the output
    cudaDeviceSynchronize();

    return 0;
}

fft_sz=32, num_ffts=504 的示例输出：

$ stor_cb_tst 
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19

相反，如果我执行 20 个大小为 1024 的 FFT，那么我会得到预期的行为：每个输出都只调用一次存储回调。fft_sz=1024, num_ffts=20 的示例输出：

$ stor_cb_tst 
   0
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19

我是不是误解了什么，我有错误，还是袖带有问题？

我在 Linux Mint 上运行它，在 GeForce GTX 1080 上使用 cuda V8.0.61、g++ 5.4.0：

$ uname -a
Linux orpheus 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8114 MBytes (8507752448 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1848 MHz (1.85 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS

这是我的编译命令：

$ nvcc -ccbin g++ -dc -m64 -o stor_cb_tst.o -c stor_cb_tst.cu 
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ nvcc -ccbin g++ -m64 -o stor_cb_tst stor_cb_tst.o -lcufft_static -lculibos
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ ./stor_cb_tst 
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19

score 2 · Accepted Answer

我能够在 CUDA 8 上重现观察结果，但不能在 CUDA 9 上重现。但是我认为即使使用 CUDA 8 也没有任何问题。让我们从更仔细地查看文档开始：

来自CUFFT 文档第 2.9.4 节：

同样，它将为输出中的每个点调用一次存储回调例程，并且仅一次。

您假设输出中的每个点都有一个相应的唯一值index传递给存储回调例程，但是我们很快就会看到情况并非如此。

它只会从最后一个阶段内核调用存储回调例程。

所以我们看到，在转换的最后阶段，可以从多个独立的内核（注意使用kernel(s) ）调用存储回调例程。

对于某些配置，线程可以按任何顺序加载或存储输入或输出，并且 cuFFT 不保证给定线程处理的输入或输出是连续的。这些特征可能随变换大小、变换类型（例如 C2C 与 C2R）、维度数量和 GPU 架构而变化。这些变体也可能从一个库版本更改为下一个库版本。

这提供了一些额外的线索，我们不应该期望在每种情况下都对所有输出数据进行良好的连续处理。并且指示的可变性可能取决于确切的变换参数以及 CUFFT 库版本。

因此，让我们开始讨论黄铜钉。CUFFT 是否在每个输出点多次调用存储回调？它不是。为了证明这一点，让我们修改您的商店回调如下：

static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
                               void *cb_info, void *sharedmem) {

    // Print the index. Each index should appear exactly once.
    //if (index < 20) printf("%8llu, %p, \n", index, a);
    cufftComplex temp = ((cufftComplex *)a)[index];
    temp.x++;
    ((cufftComplex *)a)[index] = temp;
    // Do the store
    //((cufftComplex *)a)[index] = z;
    if (index < 20) printf("%8llu, %p, %f\n", index, a, temp.x);

}

这个存储回调不会写入预期的输出，而是只会将给定的输出点增加 1。此外，我们不会仅仅打印出index值并可能做出不正确的假设，而是打印出index，加上基地址a，加上我们增加的实际值。为了使这一切正常工作，我们需要将整个输出数组预初始化为零：

cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMemset(out_buf_d, 0, fft_sz*num_ffts*sizeof(cufftComplex));  // add this

当我在 CUDA 8、linux 和 cc3.5 设备（Tesla K20x）上编译并运行修改后的代码时，输出如下：

$ nvcc -arch=sm_35 -o t20 t20.cu -rdc=true -lcufft_static -lcudadevrt -lculibos
$ ./t20
       0, 0x2305b5f800, 1.000000
       1, 0x2305b5f800, 1.000000
       2, 0x2305b5f800, 1.000000
       3, 0x2305b5f800, 1.000000
       4, 0x2305b5f800, 1.000000
       5, 0x2305b5f800, 1.000000
       6, 0x2305b5f800, 1.000000
       7, 0x2305b5f800, 1.000000
       8, 0x2305b5f800, 1.000000
       9, 0x2305b5f800, 1.000000
      10, 0x2305b5f800, 1.000000
      11, 0x2305b5f800, 1.000000
      12, 0x2305b5f800, 1.000000
      13, 0x2305b5f800, 1.000000
      14, 0x2305b5f800, 1.000000
      15, 0x2305b5f800, 1.000000
      16, 0x2305b5f800, 1.000000
      17, 0x2305b5f800, 1.000000
      18, 0x2305b5f800, 1.000000
      19, 0x2305b5f800, 1.000000
       0, 0x2305b7d800, 1.000000
       1, 0x2305b7d800, 1.000000
       2, 0x2305b7d800, 1.000000
       3, 0x2305b7d800, 1.000000
       4, 0x2305b7d800, 1.000000
       5, 0x2305b7d800, 1.000000
       6, 0x2305b7d800, 1.000000
       7, 0x2305b7d800, 1.000000
       8, 0x2305b7d800, 1.000000
       9, 0x2305b7d800, 1.000000
      10, 0x2305b7d800, 1.000000
      11, 0x2305b7d800, 1.000000
      12, 0x2305b7d800, 1.000000
      13, 0x2305b7d800, 1.000000
      14, 0x2305b7d800, 1.000000
      15, 0x2305b7d800, 1.000000
      16, 0x2305b7d800, 1.000000
      17, 0x2305b7d800, 1.000000
      18, 0x2305b7d800, 1.000000
      19, 0x2305b7d800, 1.000000
$

我们看到的是：

是的，这些index值是重复的，但是每个重复情况的基地址（指针）是不同的。因此，即使该index值重复，输出点也只写入一次。
作为进一步的确认，如果我们多次写入输出点，对于我们的特定回调，我们希望看到输出增加到 2.000000。但是我们在输出中只看到 1.000000。因此，没有一个输出点被多次写入。

我认为这种特定的输出模式很可能来自转换最后阶段的 2 个单独的内核调用。可以从探查器中获得一些进一步的证据。

正如我在开头提到的，在这个测试用例中使用 CUDA 9 而不是 CUDA 8 时，我看到了不同的行为（只打印了一组从 0 到 19 的输出索引。）但是这种可能性（从库版本到行为的变化）如前所述，文档中也说明了库版本）。

期待后续问题：

但是，如果该index值不是唯一的，并且我想对基于变化的输出应用一些转换，index我该怎么办？

我认为这里的假设是，您打算应用于批处理转换的输出的任何转换都应该只取决于批处理中的索引位置。在这个假设下，我的期望是：

索引的多内核复制将始终在批处理边界上完成。
index可以通过对传递给回调例程的值执行模批量大小操作来应用适当的转换。

我在没有证据的情况下推进了这一点，也没有尝试通过文档来确认这一点，但鉴于已经涵盖的观察结果，这是唯一对我有意义的实现。一个要点是，如果您希望应用不同批次的转换，这可能不是实现它的方法（即通过回调）。但是，正如我已经提到的，CUDA 9 中的情况似乎发生了变化。如果您对此有任何担忧，请随时提交带有所需/预期行为（和/或文档更新请求）的 RFE（错误报告）在http://developer.nvidia.com，请记住您的预期行为可能已经在 CUDA 9 中实现。

cuda - 多次调用 cufft 商店回调

1 回答 1

Related

Reference