1

I was just playing around with the Altivec extension on a power6 cluster we have. I noticed that when I compiled the code below without any optimizations, my speedup was 4 as I was expecting. However, when I compiled it again with the -O3 flag, I managed to obtain a speedup of 60!

Just wondering if anyone has more experience with this and is able to provide some insight into how the compiler is rearranging my code to perform such a speedup. Is the only possible optimization through assembly and instruction pipelining here, or is there something else I am missing that I can include in my future work.

int main(void) {
        const int m = 1000;

        __vector signed int va;
        __vector signed int vb;
        __vector signed int vc;
        __vector signed int vd;

        int a[m];
        int b[m];
        int c[m];

        for( int i=0 ; i < m ; i++ ) {
                a[i] = i;
                b[i] = i;
                c[i] = 0;
        }

        for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
                vd = (__vector signed int){cnt,cnt,cnt,cnt};

                for( int i = 0 ; i < m/4 ; i+=4 ) {
                        va = vec_ld(0, &a[i]);
                        vb = vec_ld(0, &b[i]);
                        vc = vec_add(vd, vec_add(va,vb));
                        vec_st(vc, 0, &c[i]);
                }
        }

        std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";

        return 0;
}
4

1 回答 1

4

我在 Power 7 上做了一些事情,并且我看到了 XLC 编译器非常奇怪的事情。但没有这么奇怪!(至少不是 60 倍……)

关于 PowerPC 系列(至少对于 Power6 和 Power7)需要注意的一点是,与 x86/x64 相比,指令延迟非常长,乱序执行非常弱。

因此,内部循环(如您的代码中所写)将获得极低的 IPC。

现在,我可以想象您获得 60 倍加速的唯一方法是内部循环完全展开在 -O3 下。这是可能的,因为内部循环的行程计数可以静态确定为 63。

展开该内部循环基本上可以填充整个管道。

当然我只是猜测。最好的办法是查看程序集。

另外,你是如何计时的?我在 PowerPC 上看到的许多奇怪行为都来自计时器本身……

编辑:

由于您的示例代码相当简单,因此(在程序集中)应该很容易发现该内部循环是否部分或完全展开。

于 2011-09-16T03:20:49.777 回答