cuda - CUDA：为什么 Thrust 在将数据上传到 GPU 时如此缓慢？

Question

我是 GPU 世界的新手，刚刚安装了 CUDA 来编写一些程序。我玩了推力库，但发现将数据上传到 GPU 时速度很慢。在我不错的桌面上，主机到设备部分的速度仅为 35MB/s。怎么会这样？

环境：Visual Studio 2012、CUDA 5.0、GTX760、Intel-i7、Windows 7 x64

GPU带宽测试：在此处输入图像描述

主机到设备的传输速度应该至少为 11GB/s，反之亦然！但它没有！

这是测试程序：

#include <iostream>
#include <ctime>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#define N 32<<22

int main(void)
{
    using namespace std;

    cout<<"GPU bandwidth test via thrust, data size: "<< (sizeof(double)*N) / 1000000000.0 <<" Gbytes"<<endl;
    cout<<"============program start=========="<<endl;

    int now = time(0);
    cout<<"Initializing h_vec...";
    thrust::host_vector<double> h_vec(N,0.0f);
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Uploading data to GPU...";
    thrust::device_vector<double> d_vec = h_vec;
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Downloading data to h_vec...";
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl<<endl;

    system("PAUSE");
    return 0;
}

程序输出：在此处输入图像描述

下载速度：不到 1 秒，与标称 11GB/s 相比相当合理。
上传速度：1.07374GB/32秒快33.5MB/s了，一点意义都没有。

有谁知道原因？还是只是推力的方式？

谢谢！！

score 9 · Accepted Answer

您的比较有几个缺陷，其中一些包含在评论中。

您需要消除任何分配效应。您可以通过先进行一些“热身”传输来做到这一点。
您需要消除任何“启动”效应。您可以通过先进行一些“热身”传输来做到这一点。
比较数据时，请记住这bandwidthTest是使用PINNED内存分配，而推力不使用。因此推力数据传输速率会更慢。这通常会贡献大约 2 倍的因素（即固定内存传输通常比可分页内存传输快 2 倍左右。如果您想要更好地比较与使用开关bandwidthTest运行它。--memory=pageable
您选择的计时功能可能不是最好的。cudaEvents 对于计时 CUDA 操作非常可靠。

这是一个正确计时的代码：

$ cat t213.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>

#define DSIZE ((1UL<<20)*32)

int main(){

  thrust::device_vector<int> d_data(DSIZE);
  thrust::host_vector<int> h_data(DSIZE);
  float et;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  thrust::fill(h_data.begin(), h_data.end(), 1);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());

  std::cout<< "warm up iteration " << d_data[0] << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 2);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  std::cout<< "warm up iteration " << h_data[0] << std::endl;
  thrust::fill(h_data.begin(), h_data.end(), 3);
  cudaEventRecord(start);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"host to device iteration " << d_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 4);
  cudaEventRecord(start);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"device to host iteration " << h_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;

  std::cout << "finished" << std::endl;
  return 0;
}

我编译（我有一个带有 cc2.0 设备的 PCIE Gen2 系统）

$ nvcc -O3 -arch=sm_20 -o t213 t213.cu

当我运行它时，我得到以下结果：

$ ./t213
warm up iteration 1
warm up iteration 2
host to device iteration 3 elapsed time: 0.0476644
apparent bandwidth: 2685.44 MB/s
device to host iteration 4 elapsed time: 0.0500736
apparent bandwidth: 2556.24 MB/s
finished
$

这对我来说看起来是正确的，因为bandwidthTest我的系统上的 a 在任何一个方向上都会报告大约 6GB/s，因为我有一个 PCIE Gen2 系统。由于推力使用可分页内存，而不是固定内存，我得到了大约一半的带宽，即 3GB/s，而推力报告约为 2.5GB/s。

为了比较，这是我系统上的带宽测试，使用可分页内存：

$ /usr/local/cuda/samples/bin/linux/release/bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2718.2

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2428.2

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     99219.1

$

cuda - CUDA：为什么 Thrust 在将数据上传到 GPU 时如此缓慢？

1 回答 1

Related

Reference