我正在尝试比较使用 FFT 与使用窗口方法的互相关。
我的 Matlab 代码是:
isize = 20;
n = 7;
for i = 1:n %%7x7 xcorr
for j = 1:n
xcout(i,j) = sum(sum(ffcorr1 .* ref(i:i+isize-1,j:j+isize-1))); %%ref is 676 element array and ffcorr1 is a 400 element array
end
end
类似的CUDA内核:
__global__ void xc_corr(double* in_im, double* ref_im, int pix3, int isize, int n, double* out1, double* temp1, double* sum_temp1)
{
int p = blockIdx.x * blockDim.x + threadIdx.x;
int q = 0;
int i = 0;
int j = 0;
int summ = 0;
for(i = 0; i < n; ++i)
{
for(j = 0; j < n; ++j)
{
summ = 0; //force update
for(p = 0; p < pix1; ++p)
{
for(q = 0; q < pix1; ++q)
{
temp1[((i*n+j)*pix1*pix1)+p*pix1+q] = in_im[p*pix1+q] * ref_im[(p+i)*pix1+(q+j)];
sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q] += temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
out1[i*n+j] = sum_temp1[((i*n+j)*pix1*pix1)+p*pix1+q];
}
}
}
}
我在内核中将其称为
int blocksize = 64; //multiple of 32
int nblocks = (pix3+blocksize-1)/blocksize; //round to max pix3 = 400
xc_corr <<< nblocks,blocksize >>> (ffcorr1, ref_d, pix3, isize, npix, xcout, xc_partial);
cudaThreadSynchronize();
不知何故,当我对输出文件进行比较时,我看到 CUDA 内核只计算前 400 个元素。
编写这个内核的正确方法是什么?
另外,在我的内核中声明 i,j 有什么区别,如下所示?
int i = blockIdx.x * blockDim.y + threadIdx.x * threadIdx.y;
int j = blockIdx.y * blockDim.x + threadIdx.x * threadIdx.y;