c++ - 在我的机器上操作大向量时，CUDA 推力很慢

Question

我是一个 CUDA 初学者，正在阅读一些推力教程。我编写了一个简单但组织得非常糟糕的代码，并试图弄清楚推力的加速度。（这个想法正确吗？）。我尝试通过在 cpu 上添加数组并在 gpu 上添加 device_vector 来将两个向量（具有 10000000 int）添加到另一个向量。

事情是这样的：

#include <iostream>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#define N 10000000
int main(void)
{
    float time_cpu;
    float time_gpu;
    int *a = new int[N];
    int *b = new int[N];
    int *c = new int[N];
    for(int i=0;i<N;i++)
    {
        a[i]=i;
        b[i]=i*i;
    }
    clock_t start_cpu,stop_cpu;
    start_cpu=clock();
    for(int i=0;i<N;i++)
    {
        c[i]=a[i]+b[i];
    }
    stop_cpu=clock();   
    time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
    std::cout<<"Time to generate (CPU):"<<time_cpu<<std::endl;
    thrust::device_vector<int> X(N);
    thrust::device_vector<int> Y(N);
    thrust::device_vector<int> Z(N);
    for(int i=0;i<N;i++)
    {
        X[i]=i;
        Y[i]=i*i;
    }
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start,0);       
    thrust::transform(X.begin(), X.end(),
        Y.begin(),
        Z.begin(),
        thrust::plus<int>());
    cudaEventRecord(stop,0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime,start,stop);
    std::cout<<"Time to generate (thrust):"<<elapsedTime<<std::endl;
    cudaEventDestroy(start);
    cudaEventDestroy(stop); 
    getchar();
    return 0;
}

CPU 结果看起来真的很快，但是 gpu 在我的机器上运行真的很慢（i5-2320,4G,GTX 560 Ti），CPU 时间大约是 26，GPU 时间大约是 30！我只是在代码中犯了愚蠢的错误吗？还是有更深层次的原因？

作为一个 C++ 菜鸟，我一遍又一遍地检查我的代码，但在 GPU 上的推力仍然很慢，所以我做了一些实验来展示用五种不同方法计算 vectorAdd 的区别。我使用 windows APIQueryPerformanceFrequency()作为统一的时间测量方法。

每个实验看起来像这样：

f = large_interger.QuadPart;  
QueryPerformanceCounter(&large_interger);  
c1 = large_interger.QuadPart; 

for(int j=0;j<10;j++)
{
    for(int i=0;i<N;i++)//CPU array adding
    {
        c[i]=a[i]+b[i];
    }
}
QueryPerformanceCounter(&large_interger);  
c2 = large_interger.QuadPart;  
printf("Time to generate (CPU array adding) %lf ms\n", (c2 - c1) * 1000 / f);

这是我添加 GPU 数组的简单__global__函数：

__global__ void add(int *a, int *b, int *c)
{
    int tid=threadIdx.x+blockIdx.x*blockDim.x;
    while(tid<N)
    {
        c[tid]=a[tid]+b[tid];
        tid+=blockDim.x*gridDim.x;
    }
}

该函数被称为：

for(int j=0;j<10;j++)
{
    add<<<(N+127)/128,128>>>(dev_a,dev_b,dev_c);//GPU array adding
}

我将向量 a[N] 和 b[N] 添加到向量 c[N] 中，循环 10 次：

在 CPU 上添加数组
在 CPU 上添加 std::vector
在 CPU 上添加推力::host_vector
在 GPU 上添加推力::device_vector
在 GPU 上添加数组。这是结果

N=10000000

我得到了结果：

CPU阵列增加268.992968ms
CPU std::vector 添加 1908.013595ms
CPU Thrust::host_vector 添加 10776.456803ms
GPU Thrust::device_vector 添加 297.156610ms
GPU阵列增加5.210573ms

这让我很困惑，我不熟悉模板库的实现。容器和原始数据结构之间的性能真的差别很大吗？

score 9 · Accepted Answer

Most of the execution time is being spent in your loop that is initializing X[i] and Y[i]. While this is legal, it's a very slow way to initialize large device vectors. It would be better to create host vectors, initialize them, then copy those to the device. As a test, modify your code like this (right after the loop where you are initializing the device vectors X[i] and Y[i]):

}  // this is your line of code
std::cout<< "Starting GPU run" <<std::endl;  //add this line
cudaEvent_t start, stop;   //this is your line of code

You will then see that the GPU timing results appear almost immediately after that added line prints out. So all of the time you're waiting is spent in initializing those device vectors directly from host code.

When I run this on my laptop, I get a CPU time of about 40 and a GPU time of about 5, so the GPU is running about 8 times faster than the CPU for the sections of code you are actually timing.

If you create X and Y as host vectors, and then create analogous d_X and d_Y device vectors, the overall execution time will be shorter, like so:

thrust::host_vector<int> X(N);     
thrust::host_vector<int> Y(N);     
thrust::device_vector<int> Z(N);     
for(int i=0;i<N;i++)     
{     
    X[i]=i;     
    Y[i]=i*i;     
}   
thrust::device_vector<int> d_X = X;
thrust::device_vector<int> d_Y = Y;

and change your transform call to:

thrust::transform(d_X.begin(), d_X.end(),      
    d_Y.begin(),      
    Z.begin(),      
    thrust::plus<int>());

OK so you've now indicated that the CPU run measurement is faster than the GPU measurement. Sorry I jumped to conclusions. My laptop is an HP laptop with a 2.6GHz core i7 and a Quadro 1000M gpu. I'm running centos 6.2 linux. A few comments: if you're running any heavy display tasks on your GPU, that can detract from performance. Also, when benchmarking these things it's common practice to use the same mechanism for comparison, you can use cudaEvents for both if you want, it can time CPU code the same as GPU code. Also, it's common practice with thrust to do a warm up run that is untimed, then repeat the test for a measurement, and likewise it's common practice to run the test 10 times or more in a loop, then divide to get an average. In my case, I can tell the clocks() measurement is pretty coarse because successive runs will give me 30, 40 or 50. On the GPU measurement I get something like 5.18256. Some of these things may help, but I can't say exactly why your results and mine differ so much (on the GPU side).

OK I did another experiment. The compiler will make a big difference on CPU side. I compiled with -O3 switch and the CPU time dropped to 0. Then I converted the CPU timing measurement from the clocks() method to cudaEvents, and I got a CPU measured time of 12.4 (with -O3 optimization) and still 5.1 on GPU side.

Your mileage will vary based on timing method and which compiler you are using on the CPU side.

score 1 · Accepted Answer

首先，Y[i]=i*i;不适合 10M 个元素的整数。整数大约包含 1e10，而您的代码需要 1e14。

其次，看起来转换的时间是正确的，并且应该比 CPU 更快，无论您使用的是哪个库。Robert 建议在 CPU 上初始化向量，然后传输到 GPU，这对于这种情况来说是一个很好的建议。

第三，由于我们不能做整数倍数，下面是一些更简单的 CUDA 库代码（使用我工作的ArrayFire）来做类似的浮点数，用于您的基准测试：

int n = 10e6;
array x = array(seq(n));
array y = x * x;
timer t = timer::tic();
array z = x + y;
af::eval(z); af::sync();
printf("elapsed seconds: %g\n", timer::toc( t));

祝你好运！

score -1 · Accepted Answer

我最近在我的 Quadro 1000m 上使用 CUDA Thrust 进行了类似的测试。我使用thrust::sort_by_key作为基准来测试它的性能，结果太好了以至于不能说服我的嘘声。排序512MB对需要100+ms。

对于您的问题，我对两件事感到困惑。

(1) 为什么你将这个 time_cpu 乘以 1000？没有1000，它已经在几秒钟内。

time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;

(2) 而且，提到 26、30、40，您是指秒还是毫秒？'cudaEvent' 报告以 'ms' 而非 's' 为单位的经过时间。

c++ - 在我的机器上操作大向量时，CUDA 推力很慢

3 回答 3

Related

Reference