I am learning x86 assembler in order to write a compiler. In particular, I'm taking a variety of simple recursive functions and feeding them through different compilers (OCaml, GCC etc.) in order to get a better understanding of the kinds of assembler generated by different compilers.
I've got a trivial recursive integer Fibonacci function:
int fib(int x) { return (x < 2 ? x : fib(x-1)+fib(x-2)); }
My hand-coded assembly looks like this:
fib:
cmp eax, 2
jl fin
push eax
dec eax
call fib
push eax
mov eax, [esp+4]
add eax, -2
call fib
add eax, [esp]
add esp, 8
fin:
ret
Compiling that function to Intel-syntax assembler using gcc -O2 produces this enigmatic code:
_fib:
push edi
push esi
push ebx
sub esp, 16
mov edi, DWORD PTR [esp+32]
cmp edi, 1
jle L4
mov ebx, edi
xor esi, esi
L3:
lea eax, [ebx-1]
mov DWORD PTR [esp], eax
call _fib
sub ebx, 2
add esi, eax
cmp ebx, 1
jg L3
and edi, 1
L2:
lea eax, [esi+edi]
add esp, 16
pop ebx
pop esi
pop edi
ret
L4:
xor esi, esi
jmp L2
So I guess the calling convention is argument at [esp+4] and return value in eax. It starts by pushing edi, esi and ebx. Then it claims another 16 bytes for a stack frame, enough for 4 temporary ints. Then edi is read from [esp+32], which is the argument. If the argument is <=1 then it jumps to L4 which zeroes out (?) esi before jumping back to L2 which sets eax=esi+edi which is just the argument edi. If the argument was >1 then the argument is copied into ebx and zeroes esi before falling through into L3. In L3, it sets eax=ebx-1 and stores the result (n-1) at esp in the stack frame before recursing to calculate fib(n-1). The result is added to esi, ebx is set to n-2 and it loops back to L3 if ebx>1 otherwise it extracts the lower bit of edi before falling through to L2.
Why is this code so convoluted (e.g. is there a name for an optimization that has been done that I'm not seeing?)?
The recursive calls fib(n-2) seem to have been replaced with a loop accumulating in esi but that call wasn't in tail position so how was this done?
What is the purpose of the and edi, 1?
What is the purpose of the mov DWORD PTR [esp], eax?
Why is the stack frame so large?
Can you disassemble this algorithm back into C to make it clearer what is going on?
My preliminary impression is that GCC generates pretty poor x86 assembler. In this case, over 2× more code for equal performance (3.25s for fib(40) on this 1.6GHz Atom for both programs). Is that fair?