c++ - 在编写类似的 CUDA 内核时如何不重复自己没有宏？

Question

我有几个 CUDA 内核，它们基本上做同样的事情，但有一些变化。我想做的是减少所需的代码量。我的第一个想法是使用宏，所以我得到的内核看起来像这样（简化）：

__global__ void kernelA( ... )
{
   INIT(); // macro to initialize variables

   // do specific stuff for kernelA
   b = a + c;

   END(); // macro to write back the result
}

__global__ void kernelB( ... )
{
   INIT(); // macro to initialize variables

   // do specific stuff for kernelB
   b = a - c;

   END(); // macro to write back the result
}
...

由于宏令人讨厌、丑陋和邪恶，我正在寻找一种更好、更清洁的方法。有什么建议么？

（switch 语句不会完成这项工作：实际上，相同的部分和特定于内核的部分非常交织在一起。需要几个 switch 语句，这会使代码变得非常不可读。此外，函数调用不会初始化所需的变量。）

（这个问题也可能适用于一般 C++，只需将所有 'CUDA kernel' 替换为 'function' 并删除 '__global__' ）

score 5 · Accepted Answer

更新：我在评论中被告知，类和继承与 CUDA 不能很好地混合。因此，只有答案的第一部分适用于 CUDA，而其他部分则是对您问题中更一般的 C++ 部分的回答。

对于 CUDA，您将不得不使用纯函数，“C 风格”：

struct KernelVars {
  int a;
  int b;
  int c;
};

__device__ void init(KernelVars& vars) {
  INIT(); //whatever the actual code is
}

__device__ void end(KernelVars& vars) {
  END(); //whatever the actual code is
}

__global__ void KernelA(...) {
  KernelVars vars;
  init(vars);
  b = a + c;
  end(vars);
}

这是一般 C++ 的答案，您可以在其中使用 OOP 技术，如构造函数和析构函数（它们非常适合那些初始化/结束对），或者也可以与其他语言一起使用的模板方法模式：

使用 ctor/dtor 和模板，“C++ 风格”：

class KernelBase {
protected:
  int a, b, c;

public:
  KernelBase() {
    INIT(); //replace by the contents of that macro
  }   
  ~KernelBase() {
    END();  //replace by the contents of that macro
  }
  virtual void run() = 0;
};

struct KernelAdd : KernelBase {
  void run() { b = a + c; }
};

struct KernelSub : KernelBase {
  void run() { b = a - c; }
};

template<class K>
void kernel(...)
{
  K k;
  k.run();
}

void kernelA( ... ) { kernel<KernelAdd>(); }

使用模板方法模式，通用“OOP 风格”

class KernelBase {
  virtual void do_run() = 0;
protected:
  int a, b, c;
public:
  void run() { //the template method
    INIT(); 

    do_run();

    END();
  }
};

struct KernelAdd : KernelBase {
  void do_run() { b = a + c; }
};

struct KernelSub : KernelBase {
  void do_run() { b = a - c; }
};

void kernelA(...)
{
  KernelAdd k;
  k.run();
}

score 1 · Accepted Answer

您可以将设备函数用作“INIT()”和“END()”的替代方案。

__device__ int init()
{
    return threadIdx.x + blockIdx.x * blockDim.x;
}

另一种选择是使用函数模板：

#define ADD 1
#define SUB 2

template <int __op__> __global__ void caluclate(float* a, float* b, float* c)
{
   // init code ...
switch (__op__)
{
case ADD:
  c[id] = a[id] + b[id];
break;
case SUB:
  c[id] = a[id] - b[id];
break;
    }
    // end code ...
}

并使用以下方法调用它们：

calcualte<ADD><<<...>>>(a, b, c);

CUDA 编译器完成这项工作，构建不同的函数版本并删除死代码部分以优化性能。

c++ - 在编写类似的 CUDA 内核时如何不重复自己没有宏？

2 回答 2

Related

Reference