使用更好的网格布局来避免那些模运算。
对最新 Cuda 上 64 位范围的行使用唯一块索引。
让线程循环遍历所有元素并添加唯一的线程索引!
如果计算的数据在一个块(行)中是唯一的,则平铺输入数据是一种通用方法,尤其是对于更复杂的计算。
/*
* @param tileCount
*/
__global__ void addRowNumberToCells(int* inOutMat_g,
const unsigned long long int inColumnCount_s,
const int inTileCount_s)
{
//get unique block index
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
/*
* check column ranges in case kernel is called
* with more blocks then columns
* (since its block wide following syncthreads are safe)
*/
if(blockId >= inColumnCount_s)
return;
//get unique thread index
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
/*
* calculate unique and 1 blockId
* maybe shared memory is overhead
* but it shows concept if calculation is more complex
*/
__shared__ unsigned long long int blockIdAnd1_s;
if(threadIdx.x == 0)
blockIdAnd1_s = blockId + 1;
__sycnthreads();
unsigned long long int idx;
//loop over tiles
for(int i = 0; i < inTileCount_s)
{
//calculate new offset for sequence thread writes
idx = i * blockDim.x + threadIdx.x;
//check new index range in case column count is no multiple of blockDim.x
if(idx >= inColumnCount_s)
break;
inOutMat_g[idx] = blockIdAnd1_s;
}
}
示例 Cuda 2.0:
垫子[131000][1000]
blockDim.y 的必要 blockCount = 131000 / 65535 = 2 向上取整!
inTileCount_s = 1000 / 192 = 6 向上取整!
(每个块 192 个线程 = Cuda 2.0 上 100 个占用)
<<(65535, 2, 1), (192, 1, 1)>>addRowNumberToCells(mat, 1000, 6)