1

我需要有关优化我的内核和设备代码的建议。我了解 CUDA 文档(以及许多幻灯片)建议使用大线程块大小以隐藏内存和算术延迟。

我的内核和设备功能是计算密集型的。因此,我尝试使用尽可能多的寄存器,并且(显然)因此我在占用率上做出了妥协。关键是,对于我的应用程序,指令级并行性比大线程块更重要。

但是 ILP 背后的基本思想是拥有独立的指令。我的问题是

1)如何实现这一目标?在计算中,总是有一些变量被重用于其他计算。
2)任何人都可以建议或提供一些可以将依赖指令转换为独立指令的示例吗?
3)我还(某处)读到,对于算术计算,可以实现最大 ILP = 4,即一个线程计算 4 条独立指令。这是否意味着,如果存在这样四个指令并且在这之后有依赖指令,那么warp将进入等待,直到满足依赖关系?
4) 谁能推荐一些利用 ILP 的阅读材料和代码?

我在这里还提供了一些分析代码;它可能没有任何意义。代码表示以下等式:

公式

关键是我想达到最大的性能;我想为此使用 ILP。我的代码中还有其他设备功能;所以我正在使用

线程块:192
14 SM(32 核):448(核)
每个 SM 同时使用 8 个块:8 x 192:1536
使用“-ptxas-options=-v”编译代码时, 每个线程得到 50 个寄存器(占用某处约 33%)

方程中使用的所有参数都是双精度类型(n 除外),
例如 n = 2。params 数组在 param[0] 处包含 S,在 param[1] 处包含 I1,在 param[2] 处包含 I2

#define N 3.175e-3
__device__ double gpu_f_different_mean(double x, double params[], int n) {

   double S = params[0];
   double product_I = 1.0;

   for (int i = 1; i <= n; i++) {
      product_I = product_I * params[i];
   }

   double tmp   = S * exp(-N * S * x);
   double outer = product_I * tmp;

   double result = 0.0;

   for (int i = 1; i <=n; i++) {

      double reduction = (params[i] + S * x);
      double numerator = 1 + N * reduction;

      double denom_prod = 1.0;
      for (int j = 1; j<= n; j++) {
         if ( i != j)
            denom_prod = denom_prod * (params[j] - params[i]);
      }

      double denominator = pow(reduction, 2) * denom_prod;
      result             = result + (numerator / denominator);
   }

   return outer * result;
}

硬件

我正在使用 Fermi Architecture GPU GTX470,计算能力 2.0

4

1 回答 1

4

Several comments:

a) Dependency chains like the one caused by continuous updating of denom_prod can be broken by introducing multiple reduction variables:

  double denom_prod1 = 1.0;
  double denom_prod2 = 1.0;
  int j;
  for (j = 1; j <= n-1; j += 2) {
     if ( i != j)
        denom_prod1 *= (params[j  ] - params[i]);
     if ( i != j+1)
        denom_prod2 *= (params[j+1] - params[i]);
  }
  if (j < n) {
     if ( i != j)
        denom_prod1 = denom_prod * (params[j  ] - params[i]);
  }
  double denom_prod = denom_prod1 * denom_prod2;

b) The conditional inside the loop can be eliminated by breaking the loop into two parts:

  double denom_prod = 1.0;
  for (int j = 1; j < i; j++)
     denom_prod = denom_prod * (params[j] - params[i]);
  for (int j = i+1; j <= n; j++)
     denom_prod = denom_prod * (params[j] - params[i]);

c) You can exploit the fact that exchanging i and j will not change denom_prod by computing the results for (i, j) and (j, i) in one go.

d) reduction * reduction is faster (and potentially more accurate) than pow(reduction, 2)


Regarding your questions:

1) and 2) see my comment a).

3) This likely referred to the fact that Fermi-generation GPUs (compute capability 2.x) have two independent warp schedulers per SM, each capable of issuing two instructions per cycle, for a total of up to four instructions per cycle.

However the problem of dependent instructions reaches further than that, as dependent instructions suffer from a latency of ~16..24 cycles. I.e. the second of two dependent instructions has to wait for that many cycles before it can be issued. The cycles in between can either be used by independent instructions from the same warp (which have to be located in between the dependent instructions, as current Nvidia GPUs cannot issue instructions out-of-order). Or they can be used by instructions from other warps, which are always independent. So for optimal performance, you want either many warps, or consecutive independent instructions, or ideally both.

4) The publications of Vasily Volkov make for excellent reading on this subject, particularly his "Better Performance at Lower Occupancy" presentation.

于 2013-05-24T19:06:23.183 回答