由于您正在调节 if 语句d_k
,它是从块索引派生的:
d_k = (blockIdx%x-1)
if(d_k == kmax-1)then
这意味着网格中 128 个块中只有一个块会实际执行 if 语句,将那些特定的共享内存值设置为零。您的大多数块都不会执行 if 语句中的内容。
如果kmax
恰好大于 128,那么您的任何块都不会执行 if 语句。
如果您希望在每个线程块中执行该 if 语句,则需要以块索引以外的其他内容为条件。
我会就如何重构代码提出一个建议,但我不清楚将数据加载到共享内存中你想要实现什么。例如,您的 do-loop 对我来说没有多大意义:
do d_k = 0, kmax-2
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(d_j,s_d_l,d_k+1)
end do ^ ^
| |
a given thread has specific values for these indices
你s_d_j
和s_d_l
变量是线程索引。因此,给定线程将看到这个 do 循环,并且它将迭代地执行循环,将来自各种全局内存数组(、、等)的连续值加载d_bbb
到d_ccc
每个共享内存数组中完全相同的位置。
在我看来,您并不真正了解线程执行的工作原理。假设你是一个给定的线程,为s_d_j
and分配特定的值s_d_l
(并且d_k
,虽然你在将该变量重新用作循环索引时覆盖了块索引,这对我来说也很奇怪),然后看看你的代码执行是否使感觉。
编辑:基于附加评论:
您已声明您的整体数据集大小 (x,y,z) 为 (64,64,32)。你已经说过“我正在切片......通过z的数组......我想把每个切片放在一个块中”
这将向我建议您应该在每个切片中启动一个块。或者,也许您有一个算法,将多个块分配给单个切片。无论如何,我将假设您希望所有切片数据(64、64)可用于分配给该切片的给定块。我现在假设您将启动 32 个区块。扩展到多个块在单个切片上工作的情况应该不难。我还将假设一个 32x32 线程块,而不是您指出的 16x16。如果您愿意,将其扩展为使用 16x16 应该不难。
你可能会这样做:
real, shared :: s_d_aaa_adk(0:63,0:63)
real, shared :: s_d_bbb_adk(0:63,0:63)
real, shared :: s_d_ccc_adk(0:63,0:63)
c above uses 48KB of shared mem, so assuming cc 2.0+ and cache config set accordingly
d_k = (blockIdx%x-1)
s_d_j = threadIdx%x-1
s_d_l = threadIdx%y-1
c fill first quadrant
s_d_bbb_adk(s_d_j,s_d_l) = d_bbb(s_d_j,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l) = d_ccc(s_d_j,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l) = d_aaa(s_d_j,s_d_l,d_k+1)
c fill second quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = d_bbb(s_d_j+blockDim%x,s_d_l,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = d_ccc(s_d_j+blockDim%x,s_d_l,d_k+1)
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = d_aaa(s_d_j+blockDim%x,s_d_l,d_k+1)
c fill third quadrant
s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = d_bbb(s_d_j,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = d_ccc(s_d_j,s_d_l+blockDim%y,d_k+1)
s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = d_aaa(s_d_j,s_d_l+blockDim%y,d_k+1)
c fill fourth quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_bbb(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_ccc(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = d_aaa(s_d_j+blockDim%x,s_d_l+blockDim%y,d_k+1)
c just guessing about what your intent was on filling with zeroes
c this just makes sure that one of the slices at the end gets zeroes
c instead of the values from the global arrays
if(d_k == kmax-1)then
c fill first quadrant
s_d_bbb_adk(s_d_j,s_d_l) = 0
s_d_ccc_adk(s_d_j,s_d_l) = 0
s_d_aaa_adk(s_d_j,s_d_l) = 0
c fill second quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l) = 0
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l) = 0
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l) = 0
c fill third quadrant
s_d_bbb_adk(s_d_j,s_d_l+blockDim%y) = 0
s_d_ccc_adk(s_d_j,s_d_l+blockDim%y) = 0
s_d_aaa_adk(s_d_j,s_d_l+blockDim%y) = 0
c fill fourth quadrant
s_d_bbb_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
s_d_ccc_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
s_d_aaa_adk(s_d_j+blockDim%x,s_d_l+blockDim%y) = 0
endif