c++ - 在 GPU 和 CPU 上添加 2D 数组的性能

Question

我目前正在使用 GPU 和在 CPU 上使用 C++ 来试验 OpenCL 代码的性能。我编写了计算总和 z = x + y 的程序，其中 z、x 和 y 是 GPU 和 CPU 的二维数组（矩阵）。在测试了这些程序之后，我发现 CPU 在计算这个总和时比 GPU 效率更高，因为 GPU 和 CPU 之间的 PCI 总线中的数据传输速度很慢。现在我想确定需要多少总和才能使 GPU 的使用比 CPU 更高效。我计划通过将总和 z = x + y 增加到 z = x + y + y + y + y + ... 等等来做到这一点。

仅仅通过增加这个特定问题的总和数量，是否有可能使使用 GPU 比使用 CPU 更有效？

仅供参考：我使用的是 nVIDIA GeForce GT 640 显卡和 i5 Intel 核心 CPU。

任何帮助将不胜感激。

编辑：

下面我在 CPU 上附加了我的代码：

int main(int argc, const char * argv[])
{

    //This value determines the size of the nxn (square array)             
    int n = 1000;

    //Allocating the memory for the nxn arrays of floats.
    float **x = (float**)malloc(sizeof(float*)*n);
    float **y = (float**)malloc(sizeof(float*)*n);
    float **z = (float**)malloc(sizeof(float*)*n);


    //Initializing the arrays.
    for(int i = 0; i<n; i++){
        x[i] = (float*)malloc(sizeof(float)*n);
        y[i] = (float*)malloc(sizeof(float)*n);
        z[i] = (float*)malloc(sizeof(float)*n);

        for(int j = 0; j<n; j++){
            x[i][j] = i+j;
            y[i][j] = i+j;

        }
    }

    for(int i = 0; i<n; i++){
        for(int j = 0; j<n; j++){

            z[i][j] = x[i][j] + y[i][j];
            for(int k = 0; k < 100; k++){
                z[i][j] += y[i][j];
            }
        }
    }

    return 0;

}

这是使用 OpenCL 的 C++：（用于复制数据并在 GPU 上执行内核）

int n = 1000;

for(int i = 0; i<n; i++)
    {
        //Writing the data from the host to the device
        err = clEnqueueWriteBuffer(queue, d_xx, CL_TRUE, 0, sizeof(float)*n, h_xx[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not write to buffer d_xx" << std::endl;
            exit(1);
        }

        err = clEnqueueWriteBuffer(queue, d_yy, CL_TRUE, 0, sizeof(float)*n, h_yy[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not write to buffer d_yy" << std::endl;
            exit(1);
        }

        //Setting the Kernel Arguments
        err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_xx);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_xx." << std::endl;
            exit(1);
        }

        err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_yy);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_yy." << std::endl;
            exit(1);
        }

        err = clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_zz);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not set kernel argument h_zz." << std::endl;
        }

        work_units_per_kernel = n;

        //Executing the Kernel
        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &work_units_per_kernel, NULL, 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not execute kernel." << std::endl;
            exit(1);
        }

        //Reading the Data from the Kernel
        err = clEnqueueReadBuffer(queue, d_zz, CL_TRUE, 0, n*(sizeof(float)), h_zz[i], 0, NULL, NULL);
        if(err != CL_SUCCESS){
            std::cout << "Error: Could not read data from kernel." << std::endl;
            exit(1);
        }

    }

最后是在 GPU 上执行的内核代码：

__kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{

    int i = get_global_id(0);

    d_cc[i] = d_aa[i] + d_bb[i];


    for(int j = 0; j < 100; j++){
        d_cc[i] += d_bb[i];
    }


}

score 2 · Accepted Answer

对于 n = 1000*1000，您已经到了值得复制、操作和复制回来的地步。正如 DarkZero 所指出的，全局内存不是最优的，因此如果您可以将全局内存缓存到本地内存或线程内存并使用本地工作组，这将对 CPU 和 GPU 都有很大帮助。

让我们从内核开始。 d_cc在 Global Memory 中被引用了 100 次。在这种情况下，一个简单的更改是将全局内存缓存到线程内存中，然后最后将本地复制回全局。

 __kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{

     int i = get_global_id(0);

     float t_d_cc = d_aa[i] + d_bb[i]; //make a thread only version of d_cc

     for(int j = 0; j < 100; j++){
         t_d_cc += d_bb[i];
     }

     d_cc[i] = t_d_cc; //copy the thread only back to global
}

取决于硬件的另一个变化是将 d_aa 和 d_bb 缓存到本地内存中。这让 OpenCL 可以利用全局内存中的批量复制。这可能更具挑战性，因为每个 OpenCL 设备都有不同的大小和可以使用的本地工作组大小的倍数。

例如，我的 i5 的最大工作组大小为 1024，工作组的倍数为 1，所以我的本地工作组可以是 1 到 1024 之间的任何值。我的 ATI-7970 的值分别为 256 和 64，所以我的本地工作组需要是 64、128 等。这要严格得多。

 __kernel void arraysum(__global const float *d_aa, 
                        __local float *l_d_aa,
                        __global const float *d_bb,
                        __local float *l_d_bb, 
                        __global float *d_cc,
                        __local float *l_d_cc)
{

//In this example, the global_id(1) is the number of rows and global_id(0) is the columns
//So when the kernel is called, the local work group size needs to be the size of the 
//number of columns

int i = get_global_id(1)*get_global_size(0) + get_global_id(0); //Index of the row
int j = get_local_id(0); 

l_d_aa[get_local_id(0)] = d_aa[i];
l_d_bb[get_local_id(0)] = d_bb[i];

read_mem_fence(CLK_LOCAL_MEM_FENCE);

float l_d_cc[get_local_id(0)] = l_d_aa[get_local_id(0)] + l_d_bb[get_local_id(0)]; 

for(int j = 0; j < get_global_size(0); j++){
    l_d_cc[get_local_id(0)] += l_d_bb[j];
}

d_cc[i] = l_d_cc[get_local_id(0)]; //copy the thread only back to global

}

如果我的算法有误，我深表歉意，但希望它传达了如何将全局内存缓存到本地内存。同样，在 i5 上，本地工作组大小可以是 1 到 1024，但 ATI7970 限制为 64、128 等列大小。

从概念上讲要困难得多，但是使用这种方法时，OpenCL 的性能要好得多。

社区，请随时清理内核。

score 2 · Accepted Answer

许多事情让你慢下来：

1-滥用全局内存。每个全局内存访问都慢了 400 倍，而且你只使用全局内存（比如 200 次读/写）。全局内存只能用于开始读取和结束写入，绝不能作为中间值。

2-你的N长度很短。CPU 只需 1000 条指令即可完成，而 GPU 中的所有延迟都比这慢得多。因为 100MB 的副本比 1 字节的副本效率高得多，所以在复制操作中存在开销。

3- 编译器可能正在将 CPU 代码优化为乘法运算，而 GPU 代码则不能，因为它正在访问全局变量等易失性变量。

4- 对设备的内存读/写非常昂贵，如果将其包含在计算中，CPU 将很容易获胜。OpenCL 缓冲区和内核的创建也非常昂贵。请注意，您还使用了阻塞写调用，这比非阻塞调用慢得多。

c++ - 在 GPU 和 CPU 上添加 2D 数组的性能

2 回答 2

Related

Reference