I am using gettimeofday()
to time a simple matrix multiply example, but I'm getting results that are close to twice too long initially. On a RHEL6 Server machine, I'm getting "bad" timing results for up to nearly 1 second (~65 individual timings in this example). All our other machines are RHEL5 Workstation boxes and this code works much better on them; I only get a couple of "bad" results initially (for first ~20 milliseconds).
From posts on this site, I think this probably has something to do with the OS process scheduler. If I uncomment the first "for" statement below (thereby inserting an initial busy loop by repeatedly initializing the matrices a, b and c), I get zero "bad" results under both RHEL5 Workstation and RHEL6 Server. Alternatively, if I uncomment the sleep statement, I get ALL "bad" timing results for both RHEL5 & RHEL6.
For some reason, my process is starting up with only about half the access to the CPU initially, then it gets "full" access to the CPU as long as the process stays busy. If it "sleeps" and then resumes timing, it again is temporarily only getting about half the full access to the CPU.
Nothing else is happening on the machine (X is not running). I have tried "chrt" to control the priority of the process, but that changed nothing. I've verified this occurs with both GCC 4.4.6 and ICC 12.1.0. I've tried "nice" as well.
Here's the code:
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#define N 225
#define DELAY_LOOPS 8000
main() {
struct timeval _t0, _t1, _t2;
double a[N][N], b[N][N], c[N][N];
double millisec, cum_ms;
int i, j, k, l, m=0;
gettimeofday( &_t0, NULL );
// for( l=0; l<DELAY_LOOPS; l++ )
for( i=0; i<N; i++ )
for( j=0; j<N; j++ ) {
a[i][j]=0;
b[i][j]=i;
c[i][j]=j;
}
for( l=0; l<75; l++ ) {
gettimeofday( &_t1, NULL );
for( i=0; i<N; i++ )
for( j=0; j<N; j++ )
for( k=0; k<N; k++ )
a[i][j]+=b[i][k]*c[k][j];
gettimeofday( &_t2, NULL );
millisec = 1000*(_t2.tv_sec-_t1.tv_sec);
millisec += 1e-3*(_t2.tv_usec-_t1.tv_usec);
cum_ms = 1000*(_t2.tv_sec-_t0.tv_sec);
cum_ms += 1e-3*(_t2.tv_usec-_t0.tv_usec);
printf( "%d: duration %fms, cumulative %fms\n",
m++, millisec, cum_ms );
// sleep( 2 );
}
printf( "a[%d][%d]=%f\n", N/2, N/2, a[N/2][N/2] );
}
and here are the results:
% icc -O2 -o test main.c; ./test
0: duration 13.049000ms, cumulative 13.677000ms
1: duration 13.026000ms, cumulative 26.753000ms
2: duration 12.911000ms, cumulative 39.668000ms
3: duration 12.913000ms, cumulative 52.584000ms
4: duration 12.914000ms, cumulative 65.501000ms
5: duration 12.911000ms, cumulative 78.415000ms
6: duration 12.912000ms, cumulative 91.331000ms
/* snip */
64: duration 12.912000ms, cumulative 840.633000ms
65: duration 10.455000ms, cumulative 851.092000ms
66: duration 5.910000ms, cumulative 857.004000ms
67: duration 5.908000ms, cumulative 862.914000ms
68: duration 5.907000ms, cumulative 868.823000ms
69: duration 5.908000ms, cumulative 874.732000ms
70: duration 5.912000ms, cumulative 880.646000ms
71: duration 5.907000ms, cumulative 886.554000ms
72: duration 5.907000ms, cumulative 892.462000ms
73: duration 5.908000ms, cumulative 898.372000ms
74: duration 5.908000ms, cumulative 904.281000ms
a[112][112]=211680000.000000
I experience the problem regardless of optimization level (-O0, -O1, -O2, etc.).
Does anyone know anything about how scheduling is done under RHEL6 Server? Is it so very different from RHEL5 Workstation? I presume the difference I'm seeing is more a function of the fact that one box is a server edition of RHEL and the other is a workstation edition (rather than the difference between release 5 vs. 6). Is there some simple way to reduce this effect under RHEL6 Server and make it act more like the RHEL5 Workstation boxes?
Any ideas? Thanks.