我想将主机上的一组数据分割并复制到多个 GPU 的设备内存中。另外,我想同时进行所有这些复制操作。
为此,我使用了在每个 GPU 的私有流中启动的 cudaMemcpyAsync。
这就是我正在做的事情(代码中的疑问标有以 ?? 开头的注释)
#define SIZE 1000
#define GPUCOUNT 2
int* hostData = nullptr;
int *devData[GPUCOUNT];
cudaStream_t stream[GPUCOUNT];
// Create one stream per GPU
for ( int i=0; i != GPUCOUNT ; ++i )
// DO I need to call cudaSetDevice before creating stream for each GPU ??
// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );
// Allocate data on each device and copy part of host data to it
for( int i=0; i != GPUCOUNT ; ++i )
cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT ); // ?? Does blocking behavior of cudamalloc prevents asynch memcpy invoked in stream of other GPUs from running concurrently
cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] );
// Some CPU code while copy is happening
// ....
// Wait for copy on all streams to finish
// Do something else
当我阅读 C 编程指南时,我看到上述内存复制操作不会异步发生,因为在两次连续的异步内存复制启动之间,我正在调用分配设备内存的主机操作(阻塞调用)。。隐式同步
‣ 页面锁定的主机内存分配,
‣ 设备内存分配,
‣ 设备内存集,
‣ 两个地址之间的内存副本到同一设备内存,
‣ 任何 CUDA 命令到默认流,
// Allocate data on each device
for( int i=0; i != GPUCOUNT ; ++i )
cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT );
// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
// ?? DO I need to call cudaSetDevice before memory copy ??
// CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."
cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] );
此外,如果不通过在每个 GPU 的默认流(流 id 0)中启动 cudaMemcpyAsync 操作来创建显式的每个 GPU 流,是否不可能做到这一点?. 我基于以下 CUDA C 编程指南中的声明:
#define SIZE 1000
#define GPUCOUNT 2
int* hostData = nullptr;
int *devData[GPUCOUNT];
// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );
// Allocate data on each device
for( int i=0; i != GPUCOUNT ; ++i )
cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT );
// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
// ?? DO I need to call cudaSetDevice before memory copy ??
// CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."
cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, 0 );
// Some CPU code while copy is happening
// ....
// Wait for copy on all streams to finish
// Do something else