java - OpenCL如何在使用多个设备时重建缓冲区？

Question

我正在使用 jogamp jocl 库在 Java 中学习自己的 openCL。我的一项测试是创建 Mandelbrot 地图。我有四个测试：简单的串行、使用 Java 执行器接口的并行、针对单个设备的 openCL 和针对多个设备的 openCL。前三个可以，最后一个不行。当我将多设备的（正确）输出与多设备解决方案的不正确输出进行比较时，我注意到颜色大致相同，但最后一个的输出是乱码。我想我知道问题出在哪里，但我无法解决。

问题在于（恕我直言）openCL 使用矢量缓冲区并且我必须将输出转换为矩阵。我认为这个翻译是不正确的。我通过将 mandelbrot 映射划分为宽度 (xSize) 除以任务数并保留高度 (ySize) 的矩形来并行化代码。我认为我能够将该信息正确地传输到内核中，但是将其翻译回来是不正确的。

  CLMultiContext mc = CLMultiContext.create (deviceList);
  try 
  {
     CLSimpleContextFactory factory = CLQueueContextFactory.createSimple (programSource);
     CLCommandQueuePool<CLSimpleQueueContext> pool = CLCommandQueuePool.create (factory, mc);
     IntBuffer dataC = Buffers.newDirectIntBuffer (xSize * ySize);
     IntBuffer subBufferC = null;
     int tasksPerQueue = 16;
     int taskCount = pool.getSize () * tasksPerQueue;
     int sliceWidth = xSize / taskCount;
     int sliceSize = sliceWidth * ySize;
     int bufferSize = sliceSize * taskCount;
     double sliceX = (pXMax - pXMin) / (double) taskCount;
     String kernelName = "Mandelbrot";

     out.println ("sliceSize: " + sliceSize);
     out.println ("sliceWidth: " + sliceWidth);
     out.println ("sS*h:" + sliceWidth * ySize);
     List<CLTestTask> tasks = new ArrayList<CLTestTask> (taskCount);

     for (int i = 0; i < taskCount; i++) 
     {
        subBufferC = Buffers.slice (dataC, i * sliceSize, sliceSize);
        tasks.add (new CLTestTask (kernelName, i, sliceWidth, xSize, ySize, maxIterations, 
              pXMin + i * sliceX, pYMin, xStep, yStep, subBufferC));
     } // for

     pool.invokeAll (tasks);

     // submit blocking immediately
     for (CLTestTask task: tasks) pool.submit (task).get ();

     // Ready read the buffer into the frequencies matrix
     // according to me this is the part that goes wrong
     int w = taskCount * sliceWidth;
     for (int tc = 0; tc < taskCount; tc++)
     {
        int offset = tc * sliceWidth;

        for (int y = 0; y < ySize; y++)
        {
           for (int x = offset; x < offset + sliceWidth; x++)
           {
              frequencies [y][x] = dataC.get (y * w + x);
           } // for
        } // for
     } // for

     pool.release();

最后一个循环是罪魁祸首，这意味着（我认为）内核编码和主机翻译之间存在不匹配。内核：

kernel void Mandelbrot 
(
   const int width,        
   const int height,
   const int maxIterations,
   const double x0,      
   const double y0,
   const double stepX,  
   const double stepY,
   global int *output   
) 
{
    unsigned ix = get_global_id (0);
    unsigned iy = get_global_id (1);

    if (ix >= width) return;
    if (iy >= height) return;

    double r = x0 + ix * stepX;
    double i = y0 + iy * stepY;

    double x = 0;
    double y = 0;

    double magnitudeSquared = 0;
    int iteration = 0;

    while (magnitudeSquared < 4 && iteration < maxIterations) 
    {
        double x2 = x*x;
        double y2 = y*y;
        y = 2 * x * y + i;
        x = x2 - y2 + r;
        magnitudeSquared = x2+y2;
        iteration++;
    }

    output [iy * width + ix] = iteration;
}

最后一条语句将信息编码到向量中。单设备版本也使用此内核。唯一的区别是在多设备版本中我更改了宽度和 x0。正如您在 Java 代码中看到的那样，我xSize / number_of_tasks以宽度和pXMin + i * sliceXx0（而不是 pXMin）传输。

我现在已经工作了几天并且已经删除了很多错误，但是我现在看不到我做错了什么。非常感谢您的帮助。

编辑 1

@Huseyin 要求提供图片。由 openCL 单设备计算的第一个屏幕截图。

第二个屏幕截图是多设备版本，使用完全相同的参数计算。

编辑 2

有一个关于我如何将缓冲区排入队列的问题。正如您在上面的代码中看到的那样，我有一个list<CLTestTask>向其中添加任务并且缓冲区被排队的地方。CLTestTask 是一个内部类，您可以在下面找到它的代码。

最终类 CLTestTask 实现 CLTask { CLBuffer clBufferC = null; 缓冲区 bufferSliceC; 字符串内核名称；整数索引；整数切片宽度；整数宽度；整数高度；整数最大迭代次数；双 pXMin; 双 pYMin；双x_step；双 y_step;

  public CLTestTask 
  (
        String kernelName, 
        int index,
        int sliceWidth,
        int width, 
        int height,
        int maxIterations,
        double pXMin,
        double pYMin,
        double x_step,
        double y_step,
        Buffer bufferSliceC
  )
  {
     this.index = index;
     this.sliceWidth = sliceWidth;
     this.width = width;
     this.height = height;
     this.maxIterations = maxIterations;
     this.pXMin = pXMin;
     this.pYMin = pYMin;
     this.x_step = x_step;
     this.y_step = y_step;
     this.kernelName = kernelName;
     this.bufferSliceC = bufferSliceC;
  } /*** CLTestTask ***/

  public Buffer execute (final CLSimpleQueueContext qc) 
  {
     final CLCommandQueue queue = qc.getQueue ();
     final CLContext context = qc.getCLContext ();
     final CLKernel kernel = qc.getKernel (kernelName);
     clBufferC = context.createBuffer (bufferSliceC);

     out.println (pXMin + " " + sliceWidth);
     kernel
     .putArg (sliceWidth)
     .putArg (height)
     .putArg (maxIterations)
     .putArg (pXMin) // + index * x_step)
     .putArg (pYMin)
     .putArg (x_step)
     .putArg (y_step)
     .putArg (clBufferC)
     .rewind ();

     queue
     .put2DRangeKernel (kernel, 0, 0, sliceWidth, height, 0, 0)
     .putReadBuffer (clBufferC, true);

     return clBufferC.getBuffer ();
  } /*** execute ***/
} /*** Inner Class: CLTestTask ***/

score 2 · Accepted Answer

您正在创建子缓冲区

subBufferC = Buffers.slice (dataC, i * sliceSize, sliceSize);

他们的内存数据为：

0 1 3  10 11 12  19 20 21  28 29 30
4 5 6  13 14 15  22 23 24  31 32 33
7 8 9  16 17 18  25 26 27  34 35 36

通过使用opencl的矩形复制命令？如果是这样，那么您正在越界访问它们

output [iy * width + ix] = iteration;

因为width大于sliceWidth并写入内核中的边界。

如果您不进行矩形副本或子缓冲区，而只是从原始缓冲区中获取偏移量，那么它的内存布局类似于

 0  1  3  4  5  6  7  8  9 | 10 11 12
 13 14 15 16 17 18|19 20 21  22 23 24
 25 26 27|28 29 30 31 32 33  34 35 36

因此数组被访问/解释为倾斜或计算错误。

您将偏移量作为内核的参数。但是你也可以从内核入队参数中做到这一点。因此 i 和 j 将从它们的真实值（而不是零）开始，并且您不需要在内核中为所有线程添加 x0 或 y0 。

我之前写过一个多设备api。它使用多个缓冲区，每个设备一个缓冲区，它们的大小都与主缓冲区相同。他们只是将必要的部分（他们自己的领土）复制到主缓冲区（主机缓冲区）或从主缓冲区（主机缓冲区）复制，因此内核计算与所有设备保持完全相同，并使用适当的全局范围偏移。不好的一面是，主缓冲区在所有设备上都是复制的。如果你有 4 个 gpus 和 1GB 的数据，你总共需要 4GB 的缓冲区。但这样一来，无论使用多少设备，内核成分都更容易阅读。

如果您只为每个设备分配 1/N 大小的缓冲区（在 N 个设备中），那么您需要从子缓冲区的第 0 个地址复制到i*sliceHeight主缓冲区的 i 是设备索引，考虑到数组是平面的，因此需要 opencl 的矩形缓冲区复制命令每个设备的 api。我怀疑您也在使用平面数组，并在内核中使用矩形副本和溢出越界。那我建议：

从内核中删除任何与设备相关的偏移量和参数
将必要的偏移量添加到内核排队参数中，而不是参数
在每个设备上复制主缓冲区，如果您还没有完成
仅复制与设备相关的必要部分（如果是平面阵列划分，则连续复制，用于二维解释/划分阵列的矩形副本）

如果整个数据无法放入设备中，您可以尝试映射/取消映射，这样它就不会在后台分配太多。在其页面中它说：

多个命令队列可以映射内存对象的一个区域或重叠区域以供读取（即 map_flags = CL_MAP_READ）。为读取而映射的内存对象区域的内容也可以由在设备上执行的内核读取。在设备上执行的内核写入内存对象的映射区域的行为是未定义的。未定义用于写入的缓冲区或图像内存对象的映射（和取消映射）重叠区域。

并且它没有说，“读/写的非重叠映射是未定义的”，所以你应该可以在每个设备上都有映射，以便在目标缓冲区上进行并发读/写。但是当与 USE_HOST_PTR 标志一起使用时（用于最大流性能），每个子缓冲区可能需要有一个对齐的指针才能开始，这可能会使将区域拆分为适当的块变得更加困难。我为所有设备使用相同的整个数据数组，因此划分工作不是问题，因为我可以映射取消映射对齐缓冲区中的任何地址。

这是一维除法的 2 设备结果（上部由 cpu，下部由 gpu）：

这是内核内部：

    unsigned ix = get_global_id (0)%w2;
     unsigned iy = get_global_id (0)/w2;

        if (ix >= w2) return;
        if (iy >= h2) return;

        double r = ix * 0.001;
        double i = iy * 0.001;

        double x = 0;
        double y = 0;

        double magnitudeSquared = 0;
        int iteration = 0;

        while (magnitudeSquared < 4 && iteration < 255) 
        {
            double x2 = x*x;
            double y2 = y*y;
            y = 2 * x * y + i;
            x = x2 - y2 + r;
            magnitudeSquared = x2+y2;
            iteration++;
        }

        b[(iy * w2 + ix)]   =(uchar4)(iteration/5.0,iteration/5.0,iteration/5.0,244);

使用 FX8150（7 核，3.7GHz）+ R7_240，700MHz，512x512 大小的图像（每通道 8 位 + alpha）耗时 17 毫秒

还具有与主机缓冲区大小相同的子缓冲区，可以更快（无需重新分配）使用动态范围而不是静态（在异构设置、动态涡轮频率和打嗝/节流的情况下），以帮助动态负载平衡。结合“相同代码相同参数”的强大功能，它不会导致性能损失。例如，如果所有内核都从零开始并且会增加更多周期（除了内存瓶颈和可读性以及分布 c=a+b 的怪异），则c[i]=a[i]+b[i]需要在多个设备上工作。c[i+i0]=a[i+i0]+b[i+i0]

java - OpenCL如何在使用多个设备时重建缓冲区？

1 回答 1

Related

Reference