继续我对优化有限差分代码的探索……通过使用多行宏,我设法制作了一种通用算法,用于对相邻单元格的差异求和。以前我使用过函子,但性能很差。
// T0, T: double[T_size_x*T_size_y*T_size_z]
template <bool periodic_z, bool periodic_y, bool periodic_x>
void process_T(double *T0, double *T, const int &T_size_x, const int &T_size_y, const int &T_size_z) {
double sum, base;
const int dy = (T_size_y-1)*T_size_x;
const int dz = (T_size_z-1)*T_size_x*T_size_y;
int pos = 0; //T_size_x * j;
struct _Diff {
inline void operator() (const int &pos) { sum += T0[pos] - base; }
inline void edge(const int &pos) { sum += 0.5*(T0[pos] -base); }
_Diff (double &s, double &b, double *t0) : sum(s), base(b), T0(t0) {}
private:
double ∑
double &base;
double *T0;
} Diff (sum, base, T0);
#define BigOne(i, periodic, T_size, left, right, start, end) \
if (i > 0) Diff(left); \
if (i < T_size-1) Diff(right); \
if (!periodic) { \
if (i == T_size-1 || i == 1) Diff.edge(left); \
if (i == T_size-2 || i == 0) Diff.edge(right); \
} else { \
if (i == 0) Diff(end); \
if (i == T_size-1) Diff(start); \
}
for (int k = 0; k < T_size_z; ++k) {
for (int j = 0; j < T_size_y; ++j) {
for (int i = 0; i < T_size_x; ++i, ++pos) {
sum = 0;
base = T0[pos];
// process x direction
BigOne(i, periodic_x, T_size_x, pos-1, pos+1, pos - i, pos + T_size_x-1)
// process y direction
BigOne(j, periodic_y, T_size_y, pos-T_size_x, pos+T_size_x, pos - dy, pos + dy)
// process z direction
BigOne(k, periodic_z, T_size_z, pos-T_size_x*T_size_y, pos+T_size_x*T_size_y, pos - dz, pos + dz)
T[pos] = T0[pos] + sum * 0.08; // where 0.08 is some magic number
}
}
}
}
为了使代码更高效,我也考虑将仿函数转换为宏。
#define Diff(pos) sum += T0[pos] - base
#define Diff_edge(pos) sum += 0.5*(T0[pos]-base)
// Note: change Diff.edge in BigOne to Diff_edge as well
当使用g++
(4.8.2) 编译而不进行优化时,正如预期的那样,宏代码运行得更快。但是当我用-O2
or编译它时-O3
,突然上面的第一个例子产生了更好的结果(在 15000 次迭代中,仿函数在 22.7 秒内完成,宏在 23.7 秒内完成)。
为什么会这样?仿函数是否以某种方式提示编译器缓存指令?