optimization - OpenCL 内存优化 - 最近邻

Question

我正在 OpenCL 中编写一个程序，它接收两个点数组，并计算每个点的最近邻。

我有两个程序。其中之一将计算 4 个维度的距离，一个用于 6 个维度。它们如下：

4 个维度：

kernel void BruteForce(
    global  read_only float4* m,
    global  float4* y,
    global write_only ushort* i,
    read_only uint mx)
{
    int index = get_global_id(0);
    float4 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = -1;
    int x = 0;
    int mmx = mx;
    for(x = 0; x < mmx; x++)
    {
        float dist = fast_distance(curY, m[x]);
        if (dist < minDist)
        {
            minDist = dist;
            minIdx = x;
        }
    }
    i[index] = minIdx;
    y[index] = minDist;
}

6 个维度：

kernel void BruteForce(
    global  read_only float8* m,
    global  float8* y,
    global write_only ushort* i,
    read_only uint mx)
{
    int index = get_global_id(0);
    float8 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = -1;
    int x = 0;
    int mmx = mx;
    for(x = 0; x < mmx; x++)
    {
        float8 mx = m[x];
        float d0 = mx.s0 - curY.s0;
        float d1 = mx.s1 - curY.s1;
        float d2 = mx.s2 - curY.s2;
        float d3 = mx.s3 - curY.s3;
        float d4 = mx.s4 - curY.s4;
        float d5 = mx.s5 - curY.s5;

        float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
        if (dist < minDist)
        {
            minDist = dist;
            minIdx = index;
        }
    }
    i[index] = minIdx;
    y[index] = minDist;
}

我正在寻找针对 GPGPU 优化该程序的方法。我已经阅读了一些关于使用本地内存优化 GPGPU 的文章（包括http://www.macresearch.org/opencl_episode6，它带有源代码）。我试过应用它并想出了这个代码：

kernel void BruteForce(
    global  read_only float4* m,
    global  float4* y,
    global write_only ushort* i,
    __local float4 * shared)
{
    int index = get_global_id(0);
    int lsize = get_local_size(0);
    int lid = get_local_id(0);

    float4 curY = y[index];

    float minDist = MAXFLOAT;
    ushort minIdx = 64000;
    int x = 0;
    for(x = 0; x < {0}; x += lsize)
    {
        if((x+lsize) > {0}) 
            lsize = {0} - x;
        if ( (x + lid) < {0})
        {
            shared[lid] = m[x + lid];
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        for (int x1 = 0; x1 < lsize; x1++)
        {
            float dist = distance(curY, shared[x1]);

            if (dist < minDist)
            {
                minDist = dist;
                minIdx = x + x1;
            }
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    i[index] = minIdx;
    y[index] = minDist;
}

我的“i”输出得到了垃圾结果（例如许多相同的值）。谁能指出我正确的方向？我将不胜感激任何有助于我改进此代码的答案，或者可能找到上面优化版本的问题。

非常感谢 Cauê

score 1 · Accepted Answer

在这里获得很大速度的一种方法是使用本地数据结构并一次计算整个数据块。您还应该只需要一个读/写全局向量 (float4)。相同的想法可以应用于使用更小的块的 6d 版本。每个工作组都能够通过它正在处理的数据块自由地工作。我将把确切的实现留给你，因为你会知道你的应用程序的细节。

一些伪代码（4d）：

computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be 
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256 

__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;

now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global

steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed


need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
  for(k=0;k<computeBlockSize; k++)
    if(j==k) continue; //no self-comparison
    calculate distance of blockA[j] vs blockA[k]
    store min distance in blockA[j].w
    store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)

for (i=0;i<get_num_groups(0);i++)
  if (i==get_group_id(0)) continue;
  load blockB into local memory: async_work_group_copy(...)
  for(j=get_local_id(0); j<computeBlockSize;j++)
    for(k=0;k<computeBlockSize; k++)
      calculate distance of blockA[j] vs blockB[k]
      store min distance in blockA[j].w
      store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
  barrier(local_mem_fence)

write blockA and blockAnearestIndex to global memory using two async_work_group_copy

在另一个工作组写入相同的块（作为它自己的块A）时读取块B应该没有问题，因为可能只有W值发生了变化。如果碰巧遇到问题 - 或者如果您确实需要两个不同的点向量，您可以像上面那样使用两个全局向量，一个带有 A（可写），另一个带有 B（只读）。

当您的全局数据大小是 computeBlockSize 的倍数时，此算法效果最佳。为了处理边缘，我想到了两种解决方案。我建议以与上述类似的方式为非方形边缘块编写第二个内核。新内核可以在第一次之后执行，您可以保存第二次 pci-e 传输。或者，您可以使用距离 -1 来表示在比较两个元素时跳过（即如果 blockA[j].w == -1 或 blockB[k].w == -1，则继续）。不过，第二种解决方案会导致内核中出现更多分支，这就是我建议编写新内核的原因。一小部分数据点实际上会落在边缘块中。

optimization - OpenCL 内存优化 - 最近邻

1 回答 1

Related

Reference