c# - 如何减少 OpenCL/Cloo (C#) 的缓冲区创建开销？

Question

我通过 C# Cloo 接口使用 OpenCL，当我试图让它在我们的产品中运行良好时遇到了一些非常令人沮丧的问题。

我们的产品是一种计算机视觉产品，它以每秒 30 次的速度从我们的相机中获取 512x424 的像素值网格。我们希望对这些像素进行计算，以生成相对于场景中某些对象的点云。

我正在尝试计算这些像素的是，当我们得到一个新帧时，以下（每一帧）：

1) 创建一个 CommandQueue，2) 创建一个对输入像素值只读的缓冲区，3) 创建一个仅对输出点值写入的零拷贝缓冲区。4) 传入用于在 GPU 上进行计算的矩阵，5) 执行内核并等待响应。

每帧工作的一个例子是：

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        ComputeCommandQueue commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        Point3D[] realWorldPoints = points.Get(Perspective.RealWorld).Points;
        ComputeBuffer<Point3D> realPointsBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
            realWorldPoints);
        _kernel.SetMemoryArgument(0, realPointsBuffer);

        Point3D[] toPopulate = new Point3D[realWorldPoints.Length];
        PointSet pointSet = points.Get(perspective);

        ComputeBuffer<Point3D> resultBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.UseHostPointer,
            toPopulate);
        _kernel.SetMemoryArgument(1, resultBuffer);
            float[] M = new float[3 * 3];
            ReferenceFrame referenceFrame =
                perspectives.ReferenceFrames[(int)Perspective.Floor];
            AffineTransformation transform = referenceFrame.ToReferenceFrame;
            M[0] = transform.M00;
            M[1] = transform.M01;
            M[2] = transform.M02;
            M[3] = transform.M10;
            M[4] = transform.M11;
            M[5] = transform.M12;
            M[6] = transform.M20;
            M[7] = transform.M21;
            M[8] = transform.M22;

            ComputeBuffer<float> Mbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                M);
            _kernel.SetMemoryArgument(2, Mbuffer);

            float[] b = new float[3];
            b[0] = transform.b0;
            b[1] = transform.b1;
            b[2] = transform.b2;

            ComputeBuffer<float> Bbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                b);
            _kernel.SetMemoryArgument(3, Bbuffer);

            _kernel.SetValueArgument<int>(4, (int)Perspective.Floor);

            //sw.Start();

            commandQueue.Execute(_kernel,
                new long[] { 0 }, new long[] { toPopulate.Length }, null, null);
            IntPtr retPtr = commandQueue.Map(
                resultBuffer,
                true,
                ComputeMemoryMappingFlags.Read,
                0,
                toPopulate.Length, null);

            commandQueue.Unmap(resultBuffer, ref retPtr, null);

分析时，时间太长了，90% 的时间用于创建所有 ComputeBuffer 对象等。GPU 上的实际计算时间尽可能快。

我的问题是，我该如何解决这个问题？输入的像素数组对于每一帧都是不同的，所以我必须为此创建一个新的 ComputeBuffer。当我们更新场景时，我们的矩阵也可以定期更改（同样，我无法深入了解所有细节）。有没有办法在 GPU 上更新这些缓冲区？我使用的是 Intel GPGPU，所以我有共享内存，理论上可以做到这一点。

这变得令人沮丧，因为我在 GPU 上发现的速度提升一次又一次被为每一帧设置所有内容的开销所淹没。

编辑1：

我不认为我的原始代码示例真正展示了我做得足够好，所以我创建了一个真实的工作示例并将其发布在 github上。

由于遗留原因和时间原因，我无法更改我们当前产品的太多压倒一切的架构。我试图在某些速度较慢的部分“插入”GPU代码以加快速度。考虑到我所看到的限制，这可能是不可能的。但是，让我更好地解释我在做什么。

我将给出代码，但我将指代“GPUComputePoints”类中的函数“ComputePoints”。

正如您在我的 ComputePoints 函数中看到的那样，每次传入一个 CameraFrame 以及转换矩阵 M 和 b。

public static Point3D[] ComputePoints(CameraFrame frame, float[] M, float[] b)

这些是从我们的管道生成的新数组，而不是我可以闲逛的数组。所以我为每个创建一个新的 ComputeBuffer：

       ComputeBuffer<ushort> inputBuffer = new ComputeBuffer<ushort>(_context,
          ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
          frame.RawData);
        _kernel.SetMemoryArgument(0, inputBuffer);

        Point3D[] ret = new Point3D[frame.Width * frame.Height]; 
        ComputeBuffer<Point3D> outputBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(1, outputBuffer);

        ComputeBuffer<float> mBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            M);
        _kernel.SetMemoryArgument(2, mBuffer);

        ComputeBuffer<float> bBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            b);
         _kernel.SetMemoryArgument(3, bBuffer);

...我相信，这就是性能的消耗。有人提到要解决这个问题，请使用 map/unmap 功能。但是我看不出这会有什么帮助，因为我仍然需要每次都创建缓冲区来封装传入的新数组，对吧？

score 0 · Accepted Answer

输入的像素数组对于每一帧都是不同的，所以我必须为此创建一个新的 ComputeBuffer。

您可以创建一个大缓冲区，然后将其范围用于多个不同的帧。然后，您不必在每一帧都重新创建（或重新发布）。

当我们更新场景时，我们的矩阵也可以定期更改（同样，我无法深入了解所有细节）。

对于 N 次迭代/帧的每个未使用缓冲区，您可以释放，对于每个不足够的缓冲区存在，您可以释放最后一个并重新创建 2 倍大的缓冲区，以便在再次释放之前使用更多次。

如果内核参数的数量和顺序保持不变，则也不需要在每一帧都设置它们。

有没有办法在 GPU 上更新这些缓冲区？

对于 opencl 版本 <=1.2（没有共享虚拟内存？），不建议在主机端使用设备端指针或在设备端使用主机端指针

但如果它不与视频适配器或生成视频帧的任何东西冲突（并且可能使用 use_host_ptr），它可能会起作用。

无需重新创建 CommandQueue。创建一次，用于每个有序的工作。

如果您因为类似于以下的软件设计而重新创建所有这些：

 float [] results = test(videoFeedData);

那么你可以尝试类似的东西

float [] results = new float[n];
test(videoFeedData,results);

因此它不需要创建所有内容，而是获取结果或输入数据的大小，然后创建一次 opencl 缓冲区，将其缓存在地图/字典之类的某个地方，然后在采用类似大小的数组时重新使用。

实际工作会是这样的：

new frame feed-0: 1kB data ---> allocate 1kB
feed-1: 10 MB data ---> allocate 10 MB, delete 1kB one
feed-2: 3 MB data ---> re-use 10MB one
feed-3: 2 kB data ---> re-use 10MB 
feed-4: 100 MB data ---> delete 10MB, allocate 100MB
feed-5: 110 MB data ----> delete 100MB, allocate 200MB
feed-6: 120 MB data  ---> re-use 200 MB
feed-7: 150 MB data  ---> re-use 200 MB 
feed-8: 90 MB data  ---> re-use 200 MB

对于输入和输出数据。

除了实际重新创建的开销之外，重新创建许多东西可能会阻碍驱动程序的优化和重置。

也许是这样的：

 CoresGpu gpu = new CoresGpu(kernelString,options,"gpu");

 for(i 0 to 100)
 {
   float [] results = new float[n];

   // allocate new, if only not enough, deallocate old, if only not used
   gpu.compute(new object[]{getVideoFeedBuffer(),brush21x21array,results},
             new string[]{"input","input","output"},
             kernelName,numberOfThreads);

   toCloudDb(results.toList());
 }

 gpu.release(); // everything is released here

如果重新创建是必须的，没有办法逃避它，那么你甚至可以做流水线来隐藏重新创建的延迟（但仍然比完美慢）。

push data
thread-0:get video feed

push data
thread-0:get next video feed
thread-1:send old video feed to gpu

push data
thread-0:get third video feed
thread-1:send second video feed to gpu
thread-2:compute on gpu

push data
thread-0:get fourth video feed
thread-1:send third video feed to gpu
thread-2:compute second frame on gpu
thread-3:get result of first frame from gpu to RAM

push data
thread-0:get fifth video feed
thread-1:send fourth video feed to gpu
thread-2:compute third frame on gpu
thread-3:get result of second frame from gpu to RAM
pop first data

...
...
pop second data

继续这样使用类似的东西：

var result=gpu.pipeline.push(videoFeed);
if(result!=null)
{ result has been popped! }

计算、复制、视频馈送和弹出操作隐藏了部分重建延迟。如果重新创建占总时间的 %90，那么它将仅隐藏 %10。如果是 %50 则隐藏其他 %50。

5) 执行内核并等待响应。

为什么要等？框架是否相互绑定？如果没有，您也可以使用多个管道。然后，您可以在每个管道中同时重新创建许多缓冲区，这样可以完成更多工作，但浪费的周期太多。对所有内容使用大缓冲区可能是最快的。

c# - 如何减少 OpenCL/Cloo (C#) 的缓冲区创建开销？

1 回答 1

Related

Reference