c++ - 为什么这个函子在优化后比普通代码执行得更好？

Question

继续我对优化有限差分代码的探索……通过使用多行宏，我设法制作了一种通用算法，用于对相邻单元格的差异求和。以前我使用过函子，但性能很差。

// T0, T: double[T_size_x*T_size_y*T_size_z]
template <bool periodic_z, bool periodic_y, bool periodic_x>
void process_T(double *T0, double *T, const int &T_size_x, const int &T_size_y, const int &T_size_z) {
  double sum, base;
  const int dy = (T_size_y-1)*T_size_x;
  const int dz = (T_size_z-1)*T_size_x*T_size_y;
  int pos = 0; //T_size_x * j;
  struct _Diff {
    inline void operator() (const int &pos) { sum += T0[pos] - base; }
    inline void edge(const int &pos) { sum += 0.5*(T0[pos] -base); }
    _Diff (double &s, double &b, double *t0) : sum(s), base(b), T0(t0) {}
  private:
    double &sum;
    double &base;
    double *T0;
  } Diff (sum, base, T0);
  #define BigOne(i, periodic, T_size, left, right, start, end) \
  if (i > 0) Diff(left); \
  if (i < T_size-1) Diff(right); \
  if (!periodic) { \
    if (i == T_size-1 || i == 1) Diff.edge(left); \
    if (i == T_size-2 || i == 0) Diff.edge(right); \
  } else { \
    if (i == 0) Diff(end); \
    if (i == T_size-1) Diff(start); \
  }   
  for (int k = 0; k < T_size_z; ++k) {
    for (int j = 0; j < T_size_y; ++j) {
      for (int i = 0; i < T_size_x; ++i, ++pos) {
    sum = 0;
    base = T0[pos];
    // process x direction
    BigOne(i, periodic_x, T_size_x, pos-1, pos+1, pos - i, pos + T_size_x-1)
    // process y direction
    BigOne(j, periodic_y, T_size_y, pos-T_size_x, pos+T_size_x, pos - dy, pos + dy)
    // process z direction
    BigOne(k, periodic_z, T_size_z, pos-T_size_x*T_size_y, pos+T_size_x*T_size_y, pos - dz, pos + dz)
    T[pos] = T0[pos] + sum * 0.08; // where 0.08 is some magic number
      }
    }
  }
}

为了使代码更高效，我也考虑将仿函数转换为宏。

#define Diff(pos) sum += T0[pos] - base
#define Diff_edge(pos) sum += 0.5*(T0[pos]-base)
// Note: change Diff.edge in BigOne to Diff_edge as well

当使用g++(4.8.2) 编译而不进行优化时，正如预期的那样，宏代码运行得更快。但是当我用-O2or编译它时-O3，突然上面的第一个例子产生了更好的结果（在 15000 次迭代中，仿函数在 22.7 秒内完成，宏在 23.7 秒内完成）。

为什么会这样？仿函数是否以某种方式提示编译器缓存指令？

c++ - 为什么这个函子在优化后比普通代码执行得更好？

0 回答 0

Related

Reference