我在复杂到复杂的异地一维批量 FFT 中使用袖带存储回调(即我正在执行许多相同大小的一维 FFT)。从文档的第 2.9.4 节开始,我希望每个输出都只调用一次此回调。特别是从链接中逐字引用的这句话:
cuFFT 将为输入中的每个点调用一次且仅一次的加载回调例程。类似地,它将为输出中的每个点调用一次存储回调例程,并且仅一次。
尽管如此,我似乎有一个与此相矛盾的例子。在下面的代码中,我希望看到每个数字 0-19 恰好出现一次,对应于每个输出样本只调用一次存储回调。但是,当我执行 504 个大小为 32 的 1D FFT 时,每个输出都会调用存储回调两次!
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
// Very simple store callback: prints the index and does the store
static __device__ void stor_cb(void *a, size_t index, cufftComplex z,
void *cb_info, void *sharedmem) {
// Print the index. Each index should appear exactly once.
if (index < 20) printf("%8llu\n", index);
// Do the store
((cufftComplex *)a)[index] = z;
}
__device__ cufftCallbackStoreC stor_cb_ptr_d = stor_cb;
int main() {
size_t work_size;
// With these parameters, the store callback is
// called twice for each output
int fft_sz = 32; // Size of each FFT
int num_ffts = 504; // How many FFTs to do
// With these parameters, the store callback is
// called once for each output
// int fft_sz = 1024; // Size of each FFT
// int num_ffts = 20; // How many FFTs to do
// Buffers
cufftComplex *in_buf_h, *in_buf_d, *out_buf_d;
// Allocate buffers on host and device
in_buf_h = new cufftComplex[fft_sz*num_ffts];
cudaMalloc(&in_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMalloc(&out_buf_d, fft_sz*num_ffts*sizeof(cufftComplex));
// Fill input buffer with zeros and copy to device
memset(in_buf_h, 0, fft_sz*num_ffts*sizeof(cufftComplex));
cudaMemcpy(in_buf_d, in_buf_h, fft_sz*num_ffts*sizeof(cufftComplex), cudaMemcpyHostToDevice);
// Plan num_ffts of size fft_sz
cufftHandle plan;
cufftCreate(&plan);
cufftMakePlan1d(plan, fft_sz, CUFFT_C2C, num_ffts, &work_size);
// Associate save callback with plan
cufftCallbackStoreC stor_cb_ptr_h;
cudaMemcpyFromSymbol(&stor_cb_ptr_h, stor_cb_ptr_d, sizeof(stor_cb_ptr_h));
cufftXtSetCallback(plan, (void **)&stor_cb_ptr_h, CUFFT_CB_ST_COMPLEX, 0);
// Execute the plan. We don't actually care about values. The idea
// is that the store callback should be called exactly once for
// each of the fft_sz*num_ffts samples.
cufftExecC2C(plan, in_buf_d, out_buf_d, -1);
// Sync the device to flush the output
cudaDeviceSynchronize();
return 0;
}
fft_sz=32, num_ffts=504 的示例输出:
$ stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
相反,如果我执行 20 个大小为 1024 的 FFT,那么我会得到预期的行为:每个输出都只调用一次存储回调。fft_sz=1024, num_ffts=20 的示例输出:
$ stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
我是不是误解了什么,我有错误,还是袖带有问题?
我在 Linux Mint 上运行它,在 GeForce GTX 1080 上使用 cuda V8.0.61、g++ 5.4.0:
$ uname -a
Linux orpheus 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8114 MBytes (8507752448 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1848 MHz (1.85 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1080
Result = PASS
这是我的编译命令:
$ nvcc -ccbin g++ -dc -m64 -o stor_cb_tst.o -c stor_cb_tst.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ nvcc -ccbin g++ -m64 -o stor_cb_tst stor_cb_tst.o -lcufft_static -lculibos
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
$ ./stor_cb_tst
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19