optimization - What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

Question

In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it:

Its index as a lane within its warp (its "lane id")
The index of the warp of which it is a lane within the block (its "warp id")

Assuming the grid is 1-dimensional(a.k.a. linear, i.e. blockDim.y and blockDim.z are 1), one can obviously obtain these as follows:

enum : unsigned { warp_size = 32 };
auto lane_id = threadIdx.x % warp_size;
auto warp_id = threadIdx.x / warp_size;

and if you don't trust the compiler to optimize that, you could rewrite it as:

enum : unsigned { warp_size = 32, log_warp_size = 5 };
auto lane_id = threadIdx.x & (warp_size - 1);
auto warp_id = threadIdx.x >> log_warp_size;

is that the most efficient thing to do? It still seems like a lot of waste for every thread to have to compute this.

_{(inspired by this question.)}

score 16 · Accepted Answer

朴素计算是目前最有效的。

注意：这个答案已经过大量编辑。

尝试完全避免计算是非常诱人的——因为如果您深入了解，这两个值似乎已经可用。

你看，nVIDIA GPU 有特殊的寄存器，你的（编译的）代码可以读取这些寄存器来访问各种有用的信息。一个这样的寄存器持有threadIdx.x；另一个持有blockDim.x；另一个 - 时钟滴答计数；等等。显然，C++ 作为一门语言没有这些暴露；事实上，CUDA 也没有。但是，CUDA 代码编译成的中间表示，名为PTX，确实公开了这些特殊寄存器（从 PTX 1.3 开始，即 CUDA 版本 >= 2.1）。

其中两个特殊寄存器是%warpid和%laneid。现在，CUDA 支持在 CUDA 代码中使用asm关键字内联 PTX 代码 - 就像它可以用于主机端代码直接发出 CPU 汇编指令一样。通过这种机制，可以使用这些特殊寄存器：

__forceinline__ __device__ unsigned lane_id()
{
    unsigned ret; 
    asm volatile ("mov.u32 %0, %laneid;" : "=r"(ret));
    return ret;
}

__forceinline__ __device__ unsigned warp_id()
{
    // this is not equal to threadIdx.x / 32
    unsigned ret; 
    asm volatile ("mov.u32 %0, %warpid;" : "=r"(ret));
    return ret;
}

...但是这里有两个问题。

第一个问题 - 正如@Patwie 所建议的那样 -%warp_id并没有给你你真正想要的东西 - 它不是网格上下文中扭曲的索引，而是物理 SM 的上下文中（它可以容纳这么多扭曲居民一次），这两个是不一样的。所以不要使用%warp_id.

至于%lane_id，它确实为您提供了正确的值，但它几乎肯定会损害您的性能：即使它是一个“寄存器”，它也不像您的寄存器文件中的常规寄存器，具有 1 个周期的访问延迟。它是一个特殊的寄存器，在实际硬件中是使用S2R指令检索的，它可能会表现出很长的延迟。由于您几乎可以肯定已经在寄存器中拥有 threadIdx.x 的值，因此将位掩码应用于该值比检索更快%lane_id。

底线：只需从线程 ID 计算扭曲 ID 和通道 ID。我们暂时无法解决这个问题。

score 5 · Accepted Answer

另一个答案很危险！自己计算lane-id和warp-id。

#include <cuda.h>
#include <iostream>

inline __device__ unsigned get_lane_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
  return ret;
}

inline __device__ unsigned get_warp_id() {
  unsigned ret;
  asm volatile("mov.u32 %0, %warpid;" : "=r"(ret));
  return ret;
}

__global__ void kernel() {
  const int actual_warpid = get_warp_id();
  const int actual_laneid = get_lane_id();
  const int expected_warpid = threadIdx.x / 32;
  const int expected_laneid = threadIdx.x % 32;
  if (expected_laneid == 0) {
    printf("[warp:] actual: %i  expected: %i\n", actual_warpid,
           expected_warpid);
    printf("[lane:] actual: %i  expected: %i\n", actual_laneid,
           expected_laneid);
  }
}

int main(int argc, char const *argv[]) {
  dim3 grid(8, 7, 1);
  dim3 block(4 * 32, 1);

  kernel<<<grid, block>>>();
  cudaDeviceSynchronize();
  return 0;
}

这给出了类似的东西

[warp:] actual: 4  expected: 3
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
[warp:] actual: 12  expected: 1
[warp:] actual: 4  expected: 3
[warp:] actual: 0  expected: 0
[warp:] actual: 13  expected: 2
[warp:] actual: 12  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 6  expected: 1
[warp:] actual: 13  expected: 2
[warp:] actual: 10  expected: 0
[warp:] actual: 1  expected: 1
...
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0
[lane:] actual: 0  expected: 0

另请参阅 PTX 文档

一个预定义的、只读的特殊寄存器，它返回线程的 warp 标识符。经线标识符在 CTA 内提供唯一经线编号，但不跨网格内的 CTA 提供。对于单个 warp 中的所有线程，warp 标识符将是相同的。

请注意，%warpid 是易失的，并在读取时返回线程的位置，但其值可能会在执行期间发生变化，例如，由于抢占后线程的重新调度。

因此，它是调度程序的 warp-id，但不能保证它与虚拟 warp-id 匹配（从 0 开始计数）。

文档说明了这一点：

出于这个原因，如果内核代码中需要这样的值，则应该使用 %ctaid 和 %tid 来计算虚拟扭曲索引；%warpid 主要用于启用分析和诊断代码来采样和记录工作场所映射和负载分布等信息。

如果你认为，好吧，让我们使用 CUB：这甚至会影响cub::WarpId()

返回调用线程的warp ID。Warp ID 保证在 warp 中是唯一的，但可能不对应于线程块内从零开始的排名。

编辑：使用%laneid似乎是安全的。

optimization - What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

2 回答 2

朴素计算是目前最有效的。

Related

Reference