A wave simulator I've been working on with C# + Cudafy (C# -> CUDA or OpenCL translator) works great, except for the fact that running the OpenCL CPU version (Intel driver, 15" MacBook Pro Retina i7 2.7GHz, GeForce 650M (Kepler, 384 cores)) is roughly four times as fast as the GPU version.
(This happens whether I use the CL or CUDA GPU backend. The OpenCL GPU and CUDA versions perform nearly identically.)
To clarify, for a sample problem:
- OpenCL CPU 1200 Hz
- OpenCL GPU 320 Hz
- CUDA GPU -~330 Hz
I'm at a loss to explain why the CPU version would be faster than the GPU. In this case, the kernel code that's executing (in the CL case) on the CPU and GPU is identical. I select either the CPU or GPU device during initialization, but beyond that, everything is identical.
Edit
Here's the C# code that launches one of the kernels. (The others are very similar.)
public override void UpdateEz(Source source, float Time, float ca, float cb)
{
var blockSize = new dim3(1);
var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1));
Gpu.Launch(gridSize, blockSize)
.CudaUpdateEz(
Time
, ca
, cb
, source.Position.X
, source.Position.Y
, source.Value
, _gpuHx.Field
, _gpuHy.Field
, _gpuEz.Field
);
}
And, here's the relevant CUDA kernel function generated by Cudafy:
extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue, float* hx, int hxLen0, int hxLen1, float* hy, int hyLen0, int hyLen1, float* ez, int ezLen0, int ezLen1)
{
int x = blockIdx.x;
int y = blockIdx.y;
if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1)
{
ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]);
}
if (x == sourceX && y == sourceY)
{
ez[(x) * ezLen1 + ( y)] += sourceValue;
}
}
Just for completeness, here's the C# that is used to generate the CUDA:
[Cudafy]
public static void CudaUpdateEz(
GThread thread
, float time
, float ca
, float cb
, int sourceX
, int sourceY
, float sourceValue
, float[,] hx
, float[,] hy
, float[,] ez
)
{
var i = thread.blockIdx.x;
var j = thread.blockIdx.y;
if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
ez[i, j] =
ca * ez[i, j]
+
cb * (hy[i, j] - hy[i - 1, j])
-
cb * (hx[i, j] - hx[i, j - 1])
;
if (i == sourceX && j == sourceY)
ez[i, j] += sourceValue;
}
Obviously, the if
in this kernel is bad, but even the resulting pipeline stall shouldn't cause such an extreme performance delta.
The only other thing that jumps out at me is that I'm using a lame grid/block allocation scheme - ie, the grid is the size of the array to be updated, and each block is one thread. I'm sure this has some impact on performance, but I can't see it causing it to be 1/4th of the speed of the CL code running on the CPU. ARGH!