image-processing - GPU上的图像计算和值返回

Question

我有一个 C# 项目，在其中我从相机中检索灰度图像并使用图像数据进行一些计算。计算非常耗时，因为我需要多次循环整个图像，而且我都在 CPU 上完成。

现在我想尝试在 GPU 上运行评估，但我很难做到这一点，因为我以前从未做过任何 GPU 计算。

该软件应该能够在具有不同硬件的多台计算机上运行，因此例如 CUDA 对我来说不是一个解决方案，因为代码也应该在只有板载图形的笔记本电脑上运行。经过一些研究，我遇到了 Cloo（在这个项目中找到了它），这似乎是一个相当合理的选择。

到目前为止，我将 Cloo 集成到我的项目中，并试图让这个hello world 示例运行。我猜它正在运行，因为我没有得到任何异常，但我不知道在哪里可以看到打印输出。

对于我的计算，我需要将图像传递给 GPU，并且在计算过程中我还需要 xy 坐标。因此，在 C# 中，计算如下所示：

int a = 0;
for (int y = 0; y < img_height; y++){
    for (int x = 0; x < img_width; x++){
        a += image[x,y] * x * y;
    }
}

int b = 0;
for (int y = 0; y < img_height; y++){
    for (int x = 0; x < img_width; x++){
        b += image[x,y] * (x-a) * y;
    }
}

现在我想让这些计算在 GPU 上运行，并且我想并行y循环，以便在每个任务中x中运行一个 -loop。然后我可以在第二个循环块开始之前获取所有结果 a 值并将它们相加。

之后我想将这些值返回a到b我的 C# 代码并在那里使用它们。

所以，总结一下我的问题：

对于这项任务，Cloo 是一个值得推荐的选择吗？
将图像数据（16 位，短数组）和尺寸（img_width, img_height）传递给 GPU 的最佳方式是什么？
如何从 GPU 返回值？据我所知，内核总是被用作kernel void...
实现循环的最佳方法是什么？

我希望我的问题很清楚，并且我提供了足够的信息来理解我的挣扎。任何帮助表示赞赏。提前致谢。

score 0 · Accepted Answer

Let's reverse engineer the problem. Understanding the efficient processing of the "dependency-chain" of image[][], image_height, image_width, a, b

Ad 4 ) the tandem of identical `for`-loops has a poor performance

given the defined code, there could be just a single loop, thus with reduced overhead costs and best with also maximising cache-aligned vectorised code.

Cache-Naive re-formulation:

int a = 0;
int c = 1;

for (     int  y = 0; y < img_height; y++ ){
    for ( int  x = 0; x < img_width;  x++ ){
          int      intermediate = image[x,y] * y; // .SET   PROD(i[x,y],y) 
          a += x * intermediate;                  // .REUSE 1st
          c -=     intermediate;                  // .REUSE 2nd
    }
}
int b = a * c; // was my fault upon being in a hurry leaving for weekend :o)

Moving the code into the split tandem loops is only increasing these overheads and devastating any possible cache-friendly tricks in the code-performance tweaking.

Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this

OpenCL and Cloo document these details, so nothing magical beyond the documented methods is needed here.

Yet, there are latency costs associated with each such host-side to device-side + device-side to host-side transfers. Given you claim that the 16bit-1920x1200 image-data are to be re-processed ~ 10 times in a loop, there are some chances these latencies need not be spent on every such loop pass-through.

The worst performance-killer is a very shallow kernel mathematical density. The problem is, there is indeed not much to calculate in the kernel, so the chances for any efficient SIMD / GPU parallel tricks are indeed pretty low.

In this sense, the CPU-side smart-vectorised code will do much better than the ( H2D + D2H )-overheads-far latency-hostile computationally-shallow GPU-kernel processing.

Ad 1) Given 2+3 and 4 above, 1 may easily loose sense

As prototyped and given additional cache-friendly vectorised tricks, the in-ram + in-cache vectorised code will have chances to beat all OpenCL and mixed-GPU/CPU automated ad-hoc kernel compilation generated device code and it's computing efforts.

image-processing - GPU上的图像计算和值返回

1 回答 1

Ad 4 ) the tandem of identical for-loops has a poor performance

Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this

Ad 1) Given 2+3 and 4 above, 1 may easily loose sense

Related

Reference

Ad 4 ) the tandem of identical `for`-loops has a poor performance