I was just playing around with the Altivec extension on a power6 cluster we have. I noticed that when I compiled the code below without any optimizations, my speedup was 4 as I was expecting. However, when I compiled it again with the -O3 flag, I managed to obtain a speedup of 60!
Just wondering if anyone has more experience with this and is able to provide some insight into how the compiler is rearranging my code to perform such a speedup. Is the only possible optimization through assembly and instruction pipelining here, or is there something else I am missing that I can include in my future work.
int main(void) {
const int m = 1000;
__vector signed int va;
__vector signed int vb;
__vector signed int vc;
__vector signed int vd;
int a[m];
int b[m];
int c[m];
for( int i=0 ; i < m ; i++ ) {
a[i] = i;
b[i] = i;
c[i] = 0;
}
for( int cnt = 0 ; cnt < 10000000 ; cnt++ ) {
vd = (__vector signed int){cnt,cnt,cnt,cnt};
for( int i = 0 ; i < m/4 ; i+=4 ) {
va = vec_ld(0, &a[i]);
vb = vec_ld(0, &b[i]);
vc = vec_add(vd, vec_add(va,vb));
vec_st(vc, 0, &c[i]);
}
}
std::cout << c[0] << ", " << c[1] << ", " << c[2] << ", " << c[3] << "\n";
return 0;
}