Let's reverse engineer the problem. Understanding the efficient processing of the "dependency-chain" of image[][], image_height, image_width, a, b
Ad 4 ) the tandem of identical for
-loops has a poor performance
given the defined code, there could be just a single loop, thus with reduced overhead costs and best with also maximising cache-aligned vectorised code.
Cache-Naive re-formulation:
int a = 0;
int c = 1;
for ( int y = 0; y < img_height; y++ ){
for ( int x = 0; x < img_width; x++ ){
int intermediate = image[x,y] * y; // .SET PROD(i[x,y],y)
a += x * intermediate; // .REUSE 1st
c -= intermediate; // .REUSE 2nd
}
}
int b = a * c; // was my fault upon being in a hurry leaving for weekend :o)
Moving the code into the split tandem loops is only increasing these overheads and devastating any possible cache-friendly tricks in the code-performance tweaking.
Ad 3 + 2 ) kernel call-signature + CPU-side methods allow this
OpenCL and Cloo document these details, so nothing magical beyond the documented methods is needed here.
Yet, there are latency costs associated with each such host-side to device-side + device-side to host-side transfers. Given you claim that the 16bit-1920x1200 image-data are to be re-processed ~ 10 times in a loop, there are some chances these latencies need not be spent on every such loop pass-through.
The worst performance-killer is a very shallow kernel mathematical density. The problem is, there is indeed not much to calculate in the kernel, so the chances for any efficient SIMD / GPU parallel tricks are indeed pretty low.
In this sense, the CPU-side smart-vectorised code will do much better than the ( H2D + D2H )-overheads-far latency-hostile computationally-shallow GPU-kernel processing.
Ad 1) Given 2+3 and 4 above, 1 may easily loose sense
As prototyped and given additional cache-friendly vectorised tricks, the in-ram + in-cache vectorised code will have chances to beat all OpenCL and mixed-GPU/CPU automated ad-hoc kernel compilation generated device code and it's computing efforts.