I need help with the performance of the following code. It does a memcpy on two dynamically allocated arrays of arbitrary size:
int main()
{
double *a, *b;
unsigned n = 10000000, i;
a = malloc(n*sizeof(double));
b = malloc(n*sizeof(double));
for(i=0; i<n; i++) {
a[i] = 1.0;
/* b[i] = 0.0; */
}
tic();
bzero(b, n*sizeof(double));
toc("bzero1");
tic();
bzero(b, n*sizeof(double));
toc("bzero2");
tic();
memcpy(b, a, n*sizeof(double));
toc("memcpy");
}
tic/toc measure the execution time.
On my computer it takes 0.035s to memcpy (Linux, gcc version 4.4.6). If I now uncomment the line which initializes the destination array b, the code is three times faster (!) - 0.011s.
I have observed similar behavior when using a loop instead of memcpy. Usually I do not care about this since it is enough to 'initialize' the memory before using it. However, I now need to perform a simple memory copy, and do it as fast as possible. Initializing the data requires writing e.g. 0 to the memory, which is not necessary and takes time. And I would like to perform a memory copy with all available memory bandwidth.
Is there a solution to this problem? Or is it connected to the way Linux handles dynamic memory (some sort of lazy page allocation?) and can not be worked around? How is it on other systems?
Edit: The same results are obtained with gcc 4.6. I used -O3 to compile.
Edit: Thank you all for your comments. I do understand that memory mapping takes time. I guess I just have a hard time accepting that it takes so long, much longer than the actual memory access. The code has been modified to include a benchmark of the initialization of array b using two subsequent bzero calls. The timings now show
bzero1 0.273981
bzero2 0.056803
memcpy 0.117934
Clearly, the first bzero call does much more than just stream zeros to memory - that is memory mapping and memory zeroing. The second bzero call on the other hand takes half of the time required to do a memcpy, which is exactly as expected - write only time vs. read and write time. I understand that the overhead of the second bzero call must be there due to OS security reasons. What about the rest? Can I not decrease it somehow, e.g. use larger memory pages? Different kernel settings?
I should mention that I run this on Ubuntu wheeze.