我想将主机上的一组数据分割并复制到多个 GPU 的设备内存中。另外,我想同时进行所有这些复制操作。
为此,我使用了在每个 GPU 的私有流中启动的 cudaMemcpyAsync。
这就是我正在做的事情(代码中的疑问标有以 ?? 开头的注释)
#define SIZE 1000
#define GPUCOUNT 2
int* hostData = nullptr;
int *devData[GPUCOUNT];
cudaStream_t stream[GPUCOUNT];
// Create one stream per GPU
for ( int i=0; i != GPUCOUNT ; ++i )
{
// DO I need to call cudaSetDevice before creating stream for each GPU ??
cudaStreamCreate(&stream[i]));
}
// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );
// Allocate data on each device and copy part of host data to it
for( int i=0; i != GPUCOUNT ; ++i )
{
cudaSetDevice(i);
cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT ); // ?? Does blocking behavior of cudamalloc prevents asynch memcpy invoked in stream of other GPUs from running concurrently
cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] );
}
// Some CPU code while copy is happening
// ....
// Wait for copy on all streams to finish
cudaDeviceSynchronize();
// Do something else
当我阅读 C 编程指南时,我看到上述内存复制操作不会异步发生,因为在两次连续的异步内存复制启动之间,我正在调用分配设备内存的主机操作(阻塞调用)。
3.2.5.5.4。隐式同步
如果主机线程在它们之间发出以下任一操作,则来自不同流的两个命令不能同时运行:
‣ 页面锁定的主机内存分配,
‣ 设备内存分配,
‣ 设备内存集,
‣ 两个地址之间的内存副本到同一设备内存,
‣ 任何 CUDA 命令到默认流,
如果上述原因似乎是真的,那么我需要拆分我的内存分配和复制操作
// Allocate data on each device
for( int i=0; i != GPUCOUNT ; ++i )
{
cudaSetDevice(i);
cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT );
}
// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
{
// ?? DO I need to call cudaSetDevice before memory copy ??
// CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."
cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, stream[i] );
}
我上面的分析有效吗?
此外,如果不通过在每个 GPU 的默认流(流 id 0)中启动 cudaMemcpyAsync 操作来创建显式的每个 GPU 流,是否不可能做到这一点?. 我基于以下 CUDA C 编程指南中的声明:
每个设备都有自己的默认流(请参阅默认流),因此向设备的默认流发出的命令可能会乱序执行或与向任何其他设备的默认流发出的命令同时执行。
然后代码看起来像这样
#define SIZE 1000
#define GPUCOUNT 2
int* hostData = nullptr;
int *devData[GPUCOUNT];
// Allocate pinned data on host
cudaMallocHost (&hostData, SIZE );
// Allocate data on each device
for( int i=0; i != GPUCOUNT ; ++i )
{
cudaSetDevice(i);
cudaMalloc( (void**) &devData[i], sizeof(int) * SIZE/GPUCOUNT );
}
// Copy part of host data to each device
for( int i=0; i != GPUCOUNT ; ++i )
{
// ?? DO I need to call cudaSetDevice before memory copy ??
// CUDA guide says:"A memory copy will succeed even if it is issued to a stream that is not associated to the current device."
cudaMemcpyAsync( (void*) devData[i], hostData + i*SIZE/GPUCOUNT, SIZE/GPUCOUNT, cudaMemcpyHostToDevice, 0 );
}
// Some CPU code while copy is happening
// ....
// Wait for copy on all streams to finish
cudaDeviceSynchronize();
// Do something else