performance - CUDA：多 GPU 中到 GPU 1 的内存复制速度较慢

Question

我公司有两台 GTX 295，所以一台服务器总共有 4 个 GPU，我们有几台服务器。与 GPU 0、2 和 3 相比，我们 GPU 1 特别慢，所以我写了一个小速度测试来帮助找出问题的原因。

//#include <stdio.h>
//#include <stdlib.h>
//#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <cutil.h>

__global__ void test_kernel(float *d_data) {
    int tid = blockDim.x*blockIdx.x + threadIdx.x;
    for (int i=0;i<10000;++i) {
        d_data[tid] = float(i*2.2);
        d_data[tid] += 3.3;
    }
}

int main(int argc, char* argv[])
{

    int deviceCount;                                                         
    cudaGetDeviceCount(&deviceCount);
    int device = 0; //SELECT GPU HERE
    cudaSetDevice(device);


    cudaEvent_t start, stop;
    unsigned int num_vals = 200000000;
    float *h_data = new float[num_vals];
    for (int i=0;i<num_vals;++i) {
        h_data[i] = float(i);
    }

    float *d_data = NULL;
    float malloc_timer;
    cudaEventCreate(&start);
    cudaEventCreate(&stop); cudaEventRecord( start, 0 );
    cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
    cudaMalloc((void**)&d_data, sizeof(float)*num_vals);
    cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop );
    cudaEventDestroy( start );
    cudaEventDestroy( stop );


    float mem_timer;
    cudaEventCreate(&start);
    cudaEventCreate(&stop); cudaEventRecord( start, 0 );
    cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
    cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop );
    cudaEventDestroy( start );
    cudaEventDestroy( stop );

    float kernel_timer;
    cudaEventCreate(&start);
    cudaEventCreate(&stop); cudaEventRecord( start, 0 );
    test_kernel<<<1000,256>>>(d_data);
    cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop );
    cudaEventDestroy( start );
    cudaEventDestroy( stop );

    printf("cudaMalloc took %f ms\n",malloc_timer);
    printf("Copy to the GPU took %f ms\n",mem_timer);
    printf("Test Kernel took %f ms\n",kernel_timer);

    cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost);

    delete[] h_data;
    return 0;
}

结果是

GPU0 cudaMalloc 耗时 0.908640 ms 复制到 GPU 耗时 296.058777 ms 测试内核耗时 326.721283 ms

GPU1 cudaMalloc 耗时 0.913568 ms 复制到 GPU 耗时663.182251 ms 测试内核耗时 326.710785 ms

GPU2 cudaMalloc 耗时 0.925600 ms 复制到 GPU 耗时 296.915039 ms 测试内核耗时 327.127930 ms

GPU3 cudaMalloc 耗时 0.920416 ms 复制到 GPU 耗时 296.968384 ms 测试内核耗时 327.038696 ms

如您所见，GPU 的 cudaMemcpy 是 GPU1 时间量的两倍。这在我们所有的服务器之间都是一致的，总是 GPU1 很慢。任何想法为什么会这样？所有服务器都运行 Windows XP。

score 1 · Accepted Answer

1

这是一个驱动程序问题。更新到最新的驱动程序修复了它

于 2010-05-29T13:08:20.160 回答

score 0 · Accepted Answer

If you can utilized the faster video card's gddr to load, then you can do a device device tansfer at much MUCH higher bandwidth, that might help eliminate the issue also. Also, check your bandwidth with NVidia's bandwidth testing to get some physical results and test.

Good luck!

score 0 · Accepted Answer

这可能是您的 pci 总线的问题，尝试将卡交换到不同的插槽中，看看问题是否仍然存在。如果这是一个问题，请通过更快的插槽将所有数据复制到 gtx295 上，然后使用 sli top 将其复制到另一个（慢 pci 总线）gpu。

score 0 · Accepted Answer

您是否在双处理器设置中运行？当前 Tylersburg 芯片组中存在一个错误，即从 x86 (0) 到 GPU (1) 的路径的带宽比从 x86 (0) 到 GPU (0) 的直接路径慢。英特尔应该发布一个新版本来修复这个错误。尝试使用任务集将您的测试过程锁定到特定的 CPU，然后查看您得到的结果。

关于马克

performance - CUDA：多 GPU 中到 GPU 1 的内存复制速度较慢

4 回答 4

Related

Reference