c++ - Loop optimization by the IBM xlC compiler with Altivec

Question

I was just playing around with the Altivec extension on a power6 cluster we have. I noticed that when I compiled the code below without any optimizations, my speedup was 4 as I was expecting. However, when I compiled it again with the -O3 flag, I managed to obtain a speedup of 60!

Just wondering if anyone has more experience with this and is able to provide some insight into how the compiler is rearranging my code to perform such a speedup. Is the only possible optimization through assembly and instruction pipelining here, or is there something else I am missing that I can include in my future work.

int main(void) {
        const int m = 1000;

        __vector signed int va;
        __vector signed int vb;
        __vector signed int vc;
        __vector signed int vd;

        int a[m];
        int b[m];
        int c[m];

        for( int i=0 ; i < m ; i++ ) {
                a[i] = i;
                b[i] = i;
                c[i] = 0;
        }

        for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
                vd = (__vector signed int){cnt,cnt,cnt,cnt};

                for( int i = 0 ; i < m/4 ; i+=4 ) {
                        va = vec_ld(0, &a[i]);
                        vb = vec_ld(0, &b[i]);
                        vc = vec_add(vd, vec_add(va,vb));
                        vec_st(vc, 0, &c[i]);
                }
        }

        std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";

        return 0;
}

score 4 · Accepted Answer

我在 Power 7 上做了一些事情，并且我看到了 XLC 编译器非常奇怪的事情。但没有这么奇怪！（至少不是 60 倍……）

关于 PowerPC 系列（至少对于 Power6 和 Power7）需要注意的一点是，与 x86/x64 相比，指令延迟非常长，乱序执行非常弱。

因此，内部循环（如您的代码中所写）将获得极低的 IPC。

现在，我可以想象您获得 60 倍加速的唯一方法是内部循环完全展开在 -O3 下。这是可能的，因为内部循环的行程计数可以静态确定为 63。

展开该内部循环基本上可以填充整个管道。

当然我只是猜测。最好的办法是查看程序集。

另外，你是如何计时的？我在 PowerPC 上看到的许多奇怪行为都来自计时器本身……

编辑：

由于您的示例代码相当简单，因此（在程序集中）应该很容易发现该内部循环是否部分或完全展开。

c++ - Loop optimization by the IBM xlC compiler with Altivec

1 回答 1

Related

Reference