cuda - 从 CUDA 转换到 OpenCL 时我错过了什么？或者为什么我的内核返回的输出与串行代码不同

Question

这是我以压缩列格式乘以稀疏矩阵的代码

__kernel void mykernel(__global int* colvector,
                       __global int* val,
                       __global int* result,
                       __global int* index,
                       __global int* rowptr,
                       __global int* sync )
{   
    __local int vals[1000];
    for(int i=0;i<4;i++)
    {
        result[i]=0;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    barrier(CLK_GLOBAL_MEM_FENCE);

    const int items_per_row=32;//total threads working in a row

    const int thread_id=get_global_id(0)+get_local_id(0);//total threads in the program

    const int warpid = thread_id/items_per_row;//warp id is actual row

    int lane=thread_id&(items_per_row-1);//thread id within the warp

    int row = warpid;

    if(row<4)
    {
        int sum = 0;

        int row_start = rowptr[row];
        int row_end = rowptr[row+1];

        vals[get_global_id(0)]=0;
        barrier(CLK_LOCAL_MEM_FENCE);
        barrier(CLK_GLOBAL_MEM_FENCE);


        for (int i = row_start+lane; i<row_end; i+=items_per_row)
        {
            vals[get_local_id(0)]+=val[i]*colvector[index[i]];
        }

        barrier(CLK_LOCAL_MEM_FENCE);
        barrier(CLK_GLOBAL_MEM_FENCE);

        if (lane < 16 ) vals[get_local_id(0)] += vals[get_local_id(0) + 16];

        if (lane < 8 ) vals[get_local_id(0)] += vals[get_local_id(0) + 8];

        if (lane < 4 ) vals[get_local_id(0)] += vals[get_local_id(0) +4];

        if (lane < 2 ) vals[get_local_id(0)] += vals[get_local_id(0) + 2];

        if (lane < 1 ) vals[get_local_id(0)] += vals[get_local_id(0) + 1];

        barrier(CLK_LOCAL_MEM_FENCE);
        barrier(CLK_GLOBAL_MEM_FENCE);

        if(lane==0)
        {
            result[row] += vals[get_local_id(0)];
        }
    }
}

上面的 OpenCL 代码是从下面给出的 CUDA 代码转换而来的：

spmv_csr_vector_kernel(const int num_rows,
                       const int * ptr,
                       const int * indices,
                       const float * data,
                       const float * x,
                       float * y )
{
    __shared__ float vals[];

    int thread_id = blockDim.x * blockIdx.x + threadIdx.x; // global thread index

    int warp_id = thread_id / 32; // global warp index

    int lane = thread_id & (32 - 1); // thread index within the warp

    // one warp per row

    int row = warp_id;

    if (row < num_rows)
    {
        int row_start = ptr[row];

        int row_end = ptr[row+1];

        // compute running sum per thread

        vals[threadIdx.x] = 0;

        for(int jj = row_start + lane; jj < row_end; jj += 32)
        {
            vals[threadIdx.x] += data[jj] * x[indices[jj]];
        }
        // parallel reduction in shared memory

        if (lane < 16) vals[threadIdx.x] += vals[threadIdx.x + 16];

        if (lane < 8) vals[threadIdx.x] += vals[threadIdx.x + 8];

        if (lane < 4) vals[threadIdx.x] += vals[threadIdx.x + 4];

        if (lane < 2) vals[threadIdx.x] += vals[threadIdx.x + 2];

        if (lane < 1) vals[threadIdx.x] += vals[threadIdx.x + 1];

        // first thread writes the result

        if (lane == 0)
        {
            y[row] += vals[threadIdx.x];
        }
    }
}

CUDA 代码是正确的，但我的 OpenCL 内核没有返回正确的输出。我已经尝试了一个星期，但没有解决方案。有人知道我犯了什么错误吗？

score 2 · Accepted Answer

我至少可以看到一个错误。每个代码中的 thread_id 都不相同。CUDA 中的 blockDim.x * blockIdx.x + threadIdx.x == OpenCL 中的 get_global_id(0)，而不是 get_global_id(0)+get_local_id(0)。还有 get_local_id(0) == threadIdx.x

score 0 · Accepted Answer

尝试使用 swan，这可能会帮助您理解您的问题。

你可以在这里找到一篇关于它的文章。

cuda - 从 CUDA 转换到 OpenCL 时我错过了什么？或者为什么我的内核返回的输出与串行代码不同

2 回答 2

Related

Reference