algorithm - TERCOM 算法 - 在 CUDA 中从单线程更改为多线程

Question

我目前正在将TERCOM 算法从仅使用 1 个线程移植到使用多个线程。简要说明，TERCOM 算法接收 5 个测量值和航向，并将这些测量值与预存的地图进行比较。该算法将选择最佳匹配，即最低平均绝对差（MAD），并返回位置。

该代码与一个线程和 for 循环完美配合，但是当我尝试使用多个线程并阻塞时，它返回错误的答案。似乎多线程版本不像单线程版本那样“运行”计算。有谁知道我做错了什么？

这是使用for循环的代码

__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    //Without threads
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    //Calculate Mean Absolute Difference
    for(float row=0;row<m;row++)
    {
        for(float col=0;col<n;col++)
        {
            for(float g=0; g<N; g++)
            {
                f[(int)g] = tex2D (tex, col+(g-2)*offset_x+0.5f, row+(g-2)*offset_y+0.5f);
                MAD += abs(measurements[(int)g]-f[(int)g]);
            }
            if(MAD<min) 
            {
                min=MAD;
                pos[0]=col;
                pos[1]=row;
            }
            MAD=0;                  //Reset MAD
        }
    }

    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}

这是我尝试使用多个线程

__global__ void kernel (int m, int n, int h, int N, float *f, float heading, float *measurements) 
{
    // With threads
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int idy = blockIdx.y * blockDim.y + threadIdx.y;
    float pos[2]={0};
    float theta=heading*(PI/180);
    float MAD=0;

    // Calculate how much to move in x and y direction
    float offset_x = h*cos(theta);
    float offset_y = -h*sin(theta); 

    float min=100000; //Some High value

    if(idx < n && idy < m)
    {
        for(float g=0; g<N; g++)
        {
            f[(int)g] = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
            MAD += abs(measurements[(int)g]-f[(int)g]); 
        }

        if(MAD<min) 
        {
            min=MAD;
            pos[0]=idx;
            pos[1]=idy;
        }
        MAD=0;                  //Reset MAD
    }
    f[0]=min;
    f[1]=pos[0];
    f[2]=pos[1];
}

启动内核

dim3 dimBlock( 16,16 );
dim3 dimGrid;
dimGrid.x = (n + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (m + dimBlock.y - 1)/dimBlock.y;

kernel <<< dimGrid,dimBlock >>> (m, n, h, N, dev_results, heading, dev_measurements);

score 1 · Accepted Answer

这里的基本问题是代码中存在内存竞争，其中心是使用f某种线程本地暂存空间和输出变量。每个并发线程都将尝试同时将值写入相同的位置f，这将产生未定义的行为。

据我所知，f甚至根本不需要使用临时空间，内核的主要计算部分可以写成如下内容：

if(idx < n && idy < m)
{
    for(float g=0; g<N; g++)
    {
        float fval = tex2D (tex, idx+(g-2)*offset_x+0.5f, idy+(g-2)*offset_y+0.5f);
        MAD += abs(measurements[(int)g]-fval); 
    }
    min=MAD;
    pos[0]=idx;
    pos[1]=idy;
}

[免责声明：写在浏览器中，使用风险自负]

在计算结束时，每个线程都有自己的min和值pos。至少这些必须存储在唯一的全局内存中（即，输出必须为每个线程结果提供足够的空间）。然后，您将需要执行某种归约操作，以从线程局部值集中获取全局最小值。这可能在主机中，或在设备代码中，或两者的某种组合中。已经有很多代码可用于 CUDA 并行缩减，您应该能够通过搜索和/或查看 CUDA 工具包提供的示例来找到它们。将它们调整到您需要保留位置以及最小值的指定情况应该是微不足道的。

algorithm - TERCOM 算法 - 在 CUDA 中从单线程更改为多线程

1 回答 1

Related

Reference