我有类似的问题 - 没有优化,编译失败,寄存器用完,优化花了将近半个小时。我的内核有这样的表达
t1itern[II(i,j)] = (1.0 - overr) * t1itero[II(i,j)] + overr * (rhs[IJ(i-1,j-1)].rhs1 - abiter[IJ(i-1,j-1)].as * t1itern[II(i,j - 1)] - abiter[IJ(i-1,j-1)].ase * t1itero[II(i + 1,j - 1)] - abiter[IJ(i-1,j-1)].ae * t1itern[II(i + 1,j)] - abiter[IJ(i-1,j-1)].ane * t1itero[II(i + 1,j + 1)] - abiter[IJ(i-1,j-1)].an * t1itern[II(i,j + 1)] - abiter[IJ(i-1,j-1)].anw * t1itero[II(i - 1,j + 1)] - abiter[IJ(i-1,j-1)].aw * t1itern[II(i - 1,j)] - abiter[IJ(i-1,j-1)].asw * t1itero[II(i - 1,j - 1)] - rhs[IJ(i-1,j-1)].aads * t2itern[II(i,j - 1)] - rhs[IJ(i-1,j-1)].aadn * t2itern[II(i,j + 1)] - rhs[IJ(i-1,j-1)].aade * t2itern[II(i + 1,j)] - rhs[IJ(i-1,j-1)].aadw * t2itern[II(i - 1,j)] - rhs[IJ(i-1,j-1)].aadc * t2itero[II(i,j)]) / abiter[IJ(i-1,j-1)].ac;
当我重写它们时:
tt1 = lrhs.rhs1;
tt1 = tt1 - labiter.as * t1itern[II(1,j - 1)];
tt1 = tt1 - labiter.ase * t1itern[II(2,j - 1)];
tt1 = tt1 - labiter.ae * t1itern[II(2,j)];
//etc
它显着减少了编译时间和寄存器使用。