在 CUDA 中,为了覆盖多个块,从而增加数组的索引范围,我们执行以下操作:
主机端代码:
dim3 dimgrid(9,1)// total 9 blocks will be launched
dim3 dimBlock(16,1)// each block is having 16 threads // total no. of threads in
// the grid is thus 16 x9= 144.
设备端代码
...
...
idx=blockIdx.x*blockDim.x+threadIdx.x;// idx will range from 0 to 143
a[idx]=a[idx]*a[idx];
...
...
OpenCL 中实现上述情况的等价物是什么?