performance - OpenCL matrix multiplication should be faster?

Question

I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS).

I found studies where they got speedup >200x ( >1000 Gflops ). ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU to RAM and it is just ~10% of the matrix multiplication time.

#define BLOCK_SIZE 22 
__kernel void matrixMul(
      __global float* Cij, 
      __global float* Aik, 
      __global float* Bkj, 
      __const int ni, 
      __const int nj,
      __const int nk
){
//   WARRNING : interchange of  i  and  j  dimension  lower the performance >2x on my nV GT275 GPU    
int gj = get_global_id(0);    int gi = get_global_id(1); 
int bj = get_group_id(0);     int bi = get_group_id(1);  // Block index
int tj = get_local_id(0);     int ti = get_local_id(1);  // Thread index
int oj = bi*BLOCK_SIZE;       int oi = bj*BLOCK_SIZE; 
float Csub =0; 
__local float As   [BLOCK_SIZE][BLOCK_SIZE];
__local float Bs   [BLOCK_SIZE][BLOCK_SIZE];
for (int ok = 0; ok < nk; ok += BLOCK_SIZE )   {
    As[ti][tj] = Aik[ nk*(gi   ) + tj + ok ];   // A[i][k]
    Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ];        // B[k][j]
    barrier(CLK_LOCAL_MEM_FENCE);
    for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj];
    barrier(CLK_LOCAL_MEM_FENCE);
}
Cij[ nj * ( gi ) + gj ] = Csub;

}

NOTE - the strange BLOCK_SIZE=22 is the maximum BLOCK_SIZE which does fit to max work_group_size which is 512 on my GPU. In this code must hold condition BLOCK_SIZE^2 < max work_group_size. 22=int(sqrt(512)). I tried also BLOCK_SIZE=16 or 8 but it was slower tan with 22.

I also tried simple matrixMul (without using local memory) but it was even 10-times slower than numpy.dot(). I copied the code here http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html they say that even the simple version (without local memory) should run 200x faster than CPU? I don't undrestand that.

the dependecne of performance in my case is:

N =  220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464 
N =  330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205 
N =  440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548 
N =  550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217 
N =  660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522 
N =  770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638 
N =  880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152 
N =  990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478 
N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625 
N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581 
N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979

NOTE - the "Gflops" number is obtained as N^3/time and it does include time required to copy results from GPU to main memory, but this time is just few percent of total time especially for N>1000

maybe more pictorial is time in secons:

N =  220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000 
N =  330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683 
N =  440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565 
N =  550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957 
N =  660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298 
N =  770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638 
N =  880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097 
N =  990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581 
N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065 
N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581 
N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979

I was thinking that maybe some speed improvement can be obtained using async_work_group_copy or even read_imageui to copy blocks to local memory. But I don't understand why I have so big difference when I'm using basically the same code as people who say they have 200x speedup?????

score 5 · Accepted Answer

甚至不用看你的代码，让我对你的基准做一些评论。让我们忽略 numpy 并比较 Intel CPU 与 Nvidia 和 AMD GPU 的最大 SP FLOPs/s 和 DP FLOPs/s。

4 GHz 的 Intel 2600K 可以达到 4 GHz * (8 AVX) * (2 ILP) * (4 cores) = 256 SP GFLOPs/s。对于 DP，它是一半：128 DP GFLOPs/s。几周后推出的 Haswell 将使这两者都翻倍。英特尔 MKL 库在 GEMM 中的效率高于 80%。我自己的 GEMM 代码在我的 i7-2700 上获得了 70%，所以你用 numpy 引用的 5 GFlops/s 很小，比较不公平。

我不知道 GTX 275 能做什么，但我猜它的速度远远超过 50 GFLOPs/s。

您参考的文章比较了 AMD 7970。它们获得 848（90% 效率）DP GFlops/s 和 2646（70% 效率）SP GFlops/s。这接近 CPU 性能的 10 倍，而不是 200 倍！

编辑：您对 FLOP 的计算是错误的，它应该是 2.0*n^3。这仍然是近似值，但它是渐近正确的。让我解释。

考虑一个 3D 点积。它是 x1*x2+y1*y2+z1*z2。那是 3 次乘法和 2 次加法。因此，N 维点积是 n 次乘法和 (n-1) 次加法。矩阵乘积相当于 nxn 点乘积，即 n*n*n 乘法和 n*n*(n-1) 加法。这大约是 2.0*n^3 FLOPS。所以你应该把你所有的 Gflops/s 数加倍。

编辑：您可能需要考虑内核时间。自从我使用 OpenCL 以来已经有一段时间了，但是使用 C++ 绑定我做了这样的事情

queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
//other code...run kernel

time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>();  
time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>();

score 1 · Accepted Answer

一个好的 GPU 矩阵乘法不只是使用本地内存，它会将 A、B 和/或 C 块存储在寄存器中（这会导致更高的寄存器使用率和更低的占用率，但最终速度会快得多）。这是因为 GPU 具有比本地内存更多的寄存器（128-256KB 对 NVIDIA 的 48KB），并且寄存器提供了 ALU 可以处理的尽可能多的带宽。

performance - OpenCL matrix multiplication should be faster?

2 回答 2

Related

Reference