6

A wave simulator I've been working on with C# + Cudafy (C# -> CUDA or OpenCL translator) works great, except for the fact that running the OpenCL CPU version (Intel driver, 15" MacBook Pro Retina i7 2.7GHz, GeForce 650M (Kepler, 384 cores)) is roughly four times as fast as the GPU version.

(This happens whether I use the CL or CUDA GPU backend. The OpenCL GPU and CUDA versions perform nearly identically.)

To clarify, for a sample problem:

  • OpenCL CPU 1200 Hz
  • OpenCL GPU 320 Hz
  • CUDA GPU -~330 Hz

I'm at a loss to explain why the CPU version would be faster than the GPU. In this case, the kernel code that's executing (in the CL case) on the CPU and GPU is identical. I select either the CPU or GPU device during initialization, but beyond that, everything is identical.

Edit

Here's the C# code that launches one of the kernels. (The others are very similar.)

    public override void UpdateEz(Source source, float Time, float ca, float cb)
    {
        var blockSize = new dim3(1);
        var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1));

        Gpu.Launch(gridSize, blockSize)
            .CudaUpdateEz(
                Time
                , ca
                , cb
                , source.Position.X
                , source.Position.Y
                , source.Value
                , _gpuHx.Field
                , _gpuHy.Field
                , _gpuEz.Field
            );

    }

And, here's the relevant CUDA kernel function generated by Cudafy:

extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue,  float* hx, int hxLen0, int hxLen1,  float* hy, int hyLen0, int hyLen1,  float* ez, int ezLen0, int ezLen1)
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1)
    {
        ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]);
    }
    if (x == sourceX && y == sourceY)
    {
        ez[(x) * ezLen1 + ( y)] += sourceValue;
    }
}

Just for completeness, here's the C# that is used to generate the CUDA:

    [Cudafy]
    public static void CudaUpdateEz(
        GThread thread
        , float time
        , float ca
        , float cb
        , int sourceX
        , int sourceY
        , float sourceValue
        , float[,] hx
        , float[,] hy
        , float[,] ez
        )
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
            ez[i, j] =
                ca * ez[i, j]
                +
                cb * (hy[i, j] - hy[i - 1, j])
                -
                cb * (hx[i, j] - hx[i, j - 1])
                ;

        if (i == sourceX && j == sourceY)
            ez[i, j] += sourceValue;
    }

Obviously, the if in this kernel is bad, but even the resulting pipeline stall shouldn't cause such an extreme performance delta.

The only other thing that jumps out at me is that I'm using a lame grid/block allocation scheme - ie, the grid is the size of the array to be updated, and each block is one thread. I'm sure this has some impact on performance, but I can't see it causing it to be 1/4th of the speed of the CL code running on the CPU. ARGH!

4

1 回答 1

8

回答此问题以将其从未回答列表中删除。

发布的代码表明内核启动指定了 1 个(活动)线程的线程块。这不是编写快速 GPU 代码的方法,因为它会使大部分 GPU 功能处于空闲状态。

典型的线程块大小应至少为每块 128 个线程,通常越高越好,以 32 的倍数计算,最高可达每块 512 或 1024 个,具体取决于 GPU。

GPU“喜欢”通过“可用”大量并行工作来隐藏延迟。为每个块指定更多线程有助于实现此目标。(在网格中拥有相当多的线程块也可能会有所帮助。)

此外,GPU 以 32 个为一组执行线程。指定每个块仅 1 个线程或 32 的非倍数将在每个被执行的线程块中留下一些空闲的执行槽。每个块 1 个线程特别糟糕。

于 2014-06-27T23:01:58.363 回答