cuda - 奇怪的性能结果开普勒与费米

Question

我使用基于 Tesla M2090 (Fermi) 的集群和另一个基于 K20Xm (Kepler) 的集群。我在 Fermi 集群上启动的内核比 Kepler 快 2.5 倍。这个内核是为 Kepler 集群编译的，键是 -arch=sm_35 --ptxas-options=-v，结果是

ptxas info    : Compiling entry function '_Z22_repack_one_thread_8_2ILb1EEviPtPPh' for 'sm_35'
ptxas info    : Function properties for _Z22_repack_one_thread_8_2ILb1EEviPtPPh
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 344 bytes cmem[0]

因此，使用 1024 个线程，每个线程 18 个寄存器和 0 字节共享内存，我有 100% 的多处理器占用率。

基于 Kepler 的节点性能较慢的可能原因是什么？

谢谢你。

沃伊采赫

更新

我的内核

template <bool nocheck>
__global__ void _repack_one_thread_8_2 (int size, word *input, byte **outputs)
{
  int idx = blockDim.x * blockIdx.x + threadIdx.x;

  if (nocheck || idx * 8 < size)
  {
    word *ptr = input + idx * 4;
    byte bytes[8] = {0,0,0,0,0,0,0,0};
    int i, j;

    for (i = 0; i < 4; i++, ptr++)
    {
      word b = *ptr;

      for (j = 0; j < 8; j++)
        bytes[j] |= (((b >> (j * 2)) & 3) << (i * 2));
    }

    for (i = 0; i < 8; i++)
      outputs[i][idx] = bytes[i];
  }
}

开普勒的编译命令

nvcc  -arch=sm_35 --ptxas-options=-v  -c -O3 -I.. -o

Fermi 的编译命令

nvcc  -arch=sm_20 --ptxas-options=-v  -c -O3 -I.. -o

score 1 · Accepted Answer

有很多可能的原因。简单地说我的代码在 Fermi 上运行得更快，而没有你的代码细节，这没什么好说的。

您的内核甚至可能没有在 Fermi 案例中运行（使用编译的代码-arch=sm_35不会在 Fermi GPU 上运行）。这肯定会使费米案看起来更好。

还有很多其他的可能性。开普勒调优指南中涵盖了许多可能需要研究的领域。

您还应该在代码中进行适当的 cuda 错误检查，并尝试在这两种情况下运行您的代码，cuda-memcheck以获取有关任何内核执行问题的更多信息。

score -1 · Accepted Answer

我不记得费米和开普勒的细节差异。Kepler 的计算单元可能比 Fermi 少，但可以使用更多的 wrap。下班后我会检查的。

而且 1024 线程似乎太少了，无法比较。

所以你能检查一下 1024*1024 线程是否可用。

然后，我做了一些搜索。似乎 K20Xm 的 XSM 更少（M2090 为 14 VS 16），MAD 功率更少（384 VS 1332.2 GFLOPs）和时钟频率更少（732 VS 1301 MHz）。顺便问一下，一个XSM可以当成两个SM吗？

好像很奇怪。。

数据来自： wiki， integer 比 float 慢，还有一些白皮书

cuda - 奇怪的性能结果开普勒与费米

2 回答 2

Related

Reference