First, in tuning a frequency analysis function using the Accelerate framework, the absolute system time has consistently been 225ms per iteration. Then last night I changed the order of which two of the arrays were declared and suddenly it went down to 202ms. A 10% increase by just changing the declaration order seems insane. Can someone explain to me why the compiler (which is set to optimize) is not already finding this solution?
Additional info: Before the loop there is some setup of the arrays used in the loop consisting of converting them from integer to float arrays (for Accelerate) and then taking sin and cos of the time array (16 lines long). All of the float arrays (8 arrays x 1000 elements) are declared first in the function (after a sanity check of the parameters). They are always declared the same size (by a constant), because otherwise performance suffered for little shrinkage of the footprint. I tested making them globals, but I think the compiler already figured that out as there is no performance change. The loop is 25 lines long.
---Additions---
Yes, "-Os" is the flag. (default in Xcode anyways: Fastest, Smallest)
(below is from memory - don't try to compile it, cause I didn't put in things like stride (which is 1), etc. However, all of the Accelerate calls are there)
passed parameters: inttimearray, intamparray, length, scale1, scale2, amp
float trigarray1[maxsize];
float trigarray2[maxsize];
float trigarray3[maxsize];
float trigarray4[maxsize];
float trigarray5[maxsize];
float temparray[maxsize];
float amparray[maxsize]; //these two make the most change
float timearray[maxsize]; //these two make the most change
vDSP_vfltu32(inttimearray,timearray,length); //convert to float array
vDSP_vflt16(intamparray,amparray,length); //convert to float array
vDSP_vsmul(timearray,scale1,temparray,length); //scale time and store in temp
vvcosf(temparray,trigarray3,length); //cos of temparray
vvsinf(temparray,trigarray4,length); //sin of temparray
vDSP_vneg(trigarray4,trigarray5,length); //negative of trigarray4
vDSP_vsmul(timearray,scale2,temparray,length); //scale time and store in temp
vvcosf(temparray,trigarray1,length); //cos of temparray
vvsinf(temprray,trigarray2,length); //sin of temparray
float ysum;
vDSP_sve(amparray,ysum,length); //sum of amparray
float csum, ssum, ccsum, sssum, cssum, ycsum, yssum;
for (i = 0; i<max; i++) {
vDSP_sve(trigarray1,csum,length); //sum of trigarray1
vDSP_sve(trigarray2,ssum,length); //sum of trigarray2
vDSP_svesq(trigarray1,ccsum,length); //sum of trigarray1^2
vDSP_svesq(trigarray2,sssum,length); //sum of trigarray2^2
vDSP_vmul(trigarray1,trigarray2,temparray,length); //temp = trig1*trig2
vDSP_sve(temparray,cssum,length); //sum of temp array
// 2 more sets of the above 2 lines, for the 2 remaining sums
amp[i] = (arithmetic of sums);
//trig identity to increase the sin/cos by a delta frequency
//vmma is a*b+c*d=result
vDSP_vmma (trigarray1,trigarray3,trigarray2,trigarray4,temparray,length);
vDSP_vmma (trigarray2,trigarray3,trigarray1,trigarray5,trigarray2,length);
memcpy(trigarray1,temparray,length*sizeof(float));
}
---Current Solution---
I've made some changes as follows:
The arrays are all declared aligned, and zero'd out (I'll explain next) and maxsize is now a multiple of 16
__attribute__ ((align (16))) float timearray[maxsize] = {0};
I've zero'd out all of the arrays because now, when the length is less than maxsize, I round the length up to the nearest multiple of 16 so that all of the looped functions operate on widths divisible by 16, without affecting the sums.
The benefits are:
- Slight performance boost
- The speed is nearly constant regardless of order of array declaration (which is now done right before they are needed, instead of all in a big block)
- The speed is also nearly constant for any 16-wide length (i.e. 241 to 256, or 225 to 240...), whereas before, if the length went from 256 to 255, the function would take a 3+% performance hit.
In the future (possibly with this code, as analysis requirements are still in flux), I realize I'll need to take into consideration stack usage more, and alignment/chunks of vectors. Unfortunately, for this code, I can't make these arrays static or globals as this function can be called by more than one object at a time.