1

假设我有一个从 MxN 2D 矩阵转换的 1D 数组,我想并行化每一列并执行一些操作。如何为每一列分配一个线程?

例如,如果我有一个 3x3 矩阵:

1  2  3

4  5  6

7  8  9

我想根据列 # 在列中添加每个数字(因此第一列将添加 1,第二列将添加 2....),然后变为:

1+1   2+1   3+1

4+2   5+2   6+2

7+3   8+3   9+3

我如何在 CUDA 中做到这一点?我知道如何将线程分配给数组中的所有元素,但我不知道如何将线程分配给每一列。所以,我想要的是发送每一列 (1 , 2 ,3 ) ( 4 , 5 ,6 ) (7 , 8 ,9) 并进行操作。

4

2 回答 2

3

在您的示例中,您正在根据行添加数字。不过,您知道矩阵的行/列长度(您知道它是 MxN)。你可以做的是这样的:

__global__ void MyAddingKernel(int* matrix, int M, int N)
{

    int gid = threadIdx.x + blockDim.x*blockIdx.x;
    //Let's add the row number to each element
    matrix[ gid ] += gid % M;
    //Let's add the column number to each element
    matrix[ gid ] += gid % N;

}

如果您想添加不同的号码,您可以执行以下操作:

matrix[ gid ] += my_col_number_function(gid%N);
于 2012-04-26T19:54:16.047 回答
1

使用更好的网格布局来避免那些模运算。

对最新 Cuda 上 64 位范围的行使用唯一块索引。

让线程循环遍历所有元素并添加唯一的线程索引!

如果计算的数据在一个块(行)中是唯一的,则平铺输入数据是一种通用方法,尤其是对于更复杂的计算。

/*
 * @param tileCount 
 */
__global__ void addRowNumberToCells(int* inOutMat_g, 
    const unsigned long long int inColumnCount_s, 
    const int inTileCount_s)
{

    //get unique block index
    const unsigned long long int blockId = blockIdx.x //1D
        + blockIdx.y * gridDim.x //2D
        + gridDim.x * gridDim.y * blockIdx.z; //3D

    /* 
     * check column ranges in case kernel is called 
     * with more blocks then columns 
     * (since its block wide following syncthreads are safe)
     */
    if(blockId >= inColumnCount_s)
        return;

    //get unique thread index
    const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x; 

    /*
     * calculate unique and 1 blockId
     * maybe shared memory is overhead 
     * but it shows concept if calculation is more complex
     */
    __shared__ unsigned long long int blockIdAnd1_s;
    if(threadIdx.x == 0)
        blockIdAnd1_s = blockId + 1;
    __sycnthreads();


    unsigned long long int idx;

    //loop over tiles
    for(int i = 0; i < inTileCount_s)
    {
        //calculate new offset for sequence thread writes
        idx = i * blockDim.x + threadIdx.x;
        //check new index range in case column count is no multiple of blockDim.x
        if(idx >= inColumnCount_s)
            break;
        inOutMat_g[idx] = blockIdAnd1_s;
    }

}

示例 Cuda 2.0:

垫子[131000][1000]

blockDim.y 的必要 blockCount = 131000 / 65535 = 2 向上取整!

inTileCount_s = 1000 / 192 = 6 向上取整!

(每个块 192 个线程 = Cuda 2.0 上 100 个占用)

<<(65535, 2, 1), (192, 1, 1)>>addRowNumberToCells(mat, 1000, 6)

于 2012-04-27T00:40:35.837 回答