cuda - cudaMemcpy 太慢了

Question

我用cudaMemcpy()一次将 1GB 的数据准确地复制到设备上。这需要 5.9 秒。反之则需要 5.1 秒。这是正常的吗？
函数本身在复制之前有这么多开销吗？理论上，PCIe 总线的吞吐量应该至少为 4GB/s。
没有内存传输重叠，因为 Tesla C870 不支持它。有什么提示吗？

编辑2：我的测试程序+更新的时间；希望阅读量不要太大！
这些cutCreateTimer()函数不会为我编译：'错误：标识符“cutCreateTimer”未定义' - 这可能与机器上安装的旧 cuda 版本（2.0）有关

 __host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
  time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
  printf(...);
}
t1 = t2;
}

main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);

显示的时间是：
0.86 s 分配
0.197 s 第一次复制
5.02 s 第二次复制
奇怪的是：虽然它显示第一次复制 0.197s 如果我看程序运行需要更长的时间。

score 12 · Accepted Answer

是的，这很正常。cudaMemcpy()做了很多检查和工作（如果主机内存是按通常分配的malloc()或mmap()）。它应该检查每一页数据是否在内存中，并将这些页面（一个接一个）移动到驱动程序。

您可以使用cudaHostAlloc函数或cudaMallocHost分配内存而不是malloc. 它将分配始终存储在 RAM 中的固定内存，并且可以由 GPU 的 DMA 直接访问（更快cudaMemcpy()）。引用第一个链接：

分配被页面锁定且可供设备访问的主机内存的 count 字节。驱动程序跟踪使用此函数分配的虚拟内存范围，并自动加速对 cudaMemcpy() 等函数的调用。

唯一的限制因素是系统中固定内存的总量是有限的（不超过RAM大小；最好使用不超过RAM - 1Gb）：

分配过多的固定内存可能会降低系统性能，因为它会减少系统可用于分页的内存量。因此，最好谨慎使用此功能来分配暂存区域以用于主机和设备之间的数据交换。

score 6 · Accepted Answer

Assuming the transfers are timed accurately, 1.1 seconds for a transfer of 1 GB from pinned memory seems slow. Are you sure the PCIe slot is configured to the correct width? For full performance, you'd want a x16 configuration. Some platforms provide two slots, one of which is configured as a x16, the other as a x4. So if you machine has two slots, you might want try moving the card into the other slot. Other systems have two slots, where you get x16 if only one slot is occupied, but you get two slots of x8 if both are occupied. The BIOS setup may help in figuring out how the PCIe slots are configured.

The Tesla C870 is rather old technology, but if I recall correctly transfer rates of around 2 GB/s from pinned memory should be possible with these parts, which used a 1st generation PCIe interface. Current Fermi-class GPUs use a PCIe gen 2 interface and can achieve 5+ GB/s for tranfers from pinned memory (for throughput measurements, 1 GB/s = 10^9 bytes/s).

Note that PCIe uses a packetized transport, and the packet overhead can be significant at the packet sizes supported by common chipsets, with newer chipsets typically supporting somewhat longer packets. One is unlikely to exceed 70% of the nominal per-direction maximum (4 GB/s for PCIe 1.0 x16, 8 GB/s for PCIe 2.0 x16), even for transfers from / to pinned host memory. Here is a white paper that explains the overhead issue and has a handy graph showing the utilization achievable with various packet sizes:

http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf

score 1 · Accepted Answer

除了没有正确配置的系统之外，对于可怕的 PCIe 带宽的最佳解释是 IOH/socket 与 GPU 插入的 PCIe 插槽之间的不匹配。

大多数多插槽 Intel i7 级（Nehalem，Westmere）主板的每个插槽都有一个 I/O 集线器。由于系统内存直接连接到每个 CPU，因此“本地”的 DMA 访问（从连接到与执行 DMA 访问的 GPU 相同的 IOH 的 CPU 获取内存）比非本地访问（从连接的 CPU 获取内存）快得多到另一个 IOH，必须通过链接两个 CPU 的 QPI 互连来满足事务）。

重要提示：不幸的是，SBIOS 配置系统进行交错是很常见的，这会导致连续的内存分配在套接字之间交错。这减轻了 CPU 的本地/非本地访问的性能悬崖（一种思考方式：它使所有内存访问对两个套接字都同样不利），但对 GPU 对数据的访问造成严重破坏，因为它会导致每隔一个页面在 2 -socket 系统是非本地的。

如果系统只有一个 IOH，Nehalem 和 Westmere 类系统似乎不会遇到这个问题。

（顺便说一句，Sandy Bridge 级处理器通过将 PCI Express 支持集成到 CPU 中，在这条道路上又迈出了一步，因此使用 Sandy Bridge，多插槽机器自动拥有多个 IOH。）

您可以通过使用将测试固定到套接字的工具（Linux 上的 numactl，如果可用）运行测试或使用平台相关代码来引导分配和线程在特定套接字上运行来研究这个假设。您无需花哨就可以学到很多东西——只需在 main() 的开头调用一个具有全局效果的函数即可将所有内容强制到一个或另一个套接字上，看看这是否对您的 PCIe 传输性能有很大影响。

cuda - cudaMemcpy 太慢了

3 回答 3

Related

Reference