cuda - 使用多 GPU NVIDIA 的问题

Question

我正在学习如何为我的 CUDA 应用程序使用多 GPU。我尝试了一个简单的程序，它成功地在具有两个 Tesla C2070 的系统上运行。但是，当我尝试在具有 Tesla K40c 和 Tesla C2070 的不同系统上运行相同的程序时，它显示了分段错误。可能是什么问题？我确信代码没有问题。环境中是否需要进行任何设置？我在这里附上了我的代码供您参考。

#include <stdio.h>
#include "device_launch_parameters.h"
#include "cuda_runtime_api.h"

__global__ void testA(int *a)
{
   int i = blockIdx.x * blockDim.x + threadIdx.x;
   a[i] = a[i] * 2;
}

int main()
{
   int *ai, *bi, *ao, *bo;
   int iter;
   cudaStream_t streamA, streamB;
   cudaSetDevice(0);
   cudaStreamCreate(&streamA);
   cudaMalloc((void**)&ao, 10 * sizeof(int));
   cudaHostAlloc((void**)&ai, 10 * sizeof(int), cudaHostAllocMapped);
   for(iter=0; iter<10; iter++)
   {
       ai[iter] = iter+1;
   }

   cudaSetDevice(1);
   cudaStreamCreate(&streamB);
   cudaMalloc((void**)&bo, 10 * sizeof(int));
   cudaHostAlloc((void**)&bi, 10 * sizeof(int), cudaHostAllocMapped);
   for(iter=0; iter<10; iter++)
   {
       bi[iter] = iter+11;
   }

   cudaSetDevice(0);
   cudaMemcpyAsync(ao, ai, 10 * sizeof(int), cudaMemcpyHostToDevice, streamA);
   testA<<<1, 10, 0, streamA>>>(ao);
   cudaMemcpyAsync(ai, ao, 10 * sizeof(int), cudaMemcpyDeviceToHost, streamA);

   cudaSetDevice(1);
   cudaMemcpyAsync(bo, bi, 10 * sizeof(int), cudaMemcpyHostToDevice, streamB);
   testA<<<1, 10, 0, streamB>>>(bo);
   cudaMemcpyAsync(bi, bo, 10 * sizeof(int), cudaMemcpyDeviceToHost, streamB);

   cudaSetDevice(0);
   cudaStreamSynchronize(streamA);

   cudaSetDevice(1);
   cudaStreamSynchronize(streamB);

   printf("%d %d %d %d %d\n",ai[0],ai[1],ai[2],ai[3],ai[4]);
   printf("%d %d %d %d %d\n",bi[0],bi[1],bi[2],bi[3],bi[4]);
   return 0;
}

在 for 循环内初始化 bi 数组时会发生分段错误，这意味着没有为 bi 分配内存。

score 1 · Accepted Answer

根据您根据错误检查提供的新信息，您遇到的问题是由于 ECC 错误造成的。

当 GPU 在当前会话中检测到双位 ECC 错误时，它不再可用于计算活动，直到：

GPU被重置（例如通过系统重启，或通过驱动程序卸载/重新加载，或通过手动nvidia-smi等），

（或者）

ECC 被禁用（通常也可能需要系统重新启动或 gpu 重置）

nvidia-smi您可以使用该命令查看 GPU 的 ECC 状态。您可能已经知道哪个 GPU 报告了 ECC 错误，因为您禁用了 ECC，但如果没有，根据您的初始报告，它将是与cudaSetDevice(1);命令相关联的那个，可能应该是 Tesla C2070（即不是K40)。

cuda - 使用多 GPU NVIDIA 的问题

1 回答 1

Related

Reference