I am learning x86 assembler in order to write a compiler. In particular, I'm taking a variety of simple recursive functions and feeding them through different compilers (OCaml, GCC etc.) in order to get a better understanding of the kinds of assembler generated by different compilers.
I've got a trivial recursive integer Fibonacci function:
int fib(int x) { return (x < 2 ? x : fib(x-1)+fib(x-2)); }
My hand-coded assembly looks like this:
fib:
cmp eax, 2
jl fin
push eax
dec eax
call fib
push eax
mov eax, [esp+4]
add eax, -2
call fib
add eax, [esp]
add esp, 8
fin:
ret
Compiling that function to Intel-syntax assembler using gcc -O2
produces this enigmatic code:
_fib:
push edi
push esi
push ebx
sub esp, 16
mov edi, DWORD PTR [esp+32]
cmp edi, 1
jle L4
mov ebx, edi
xor esi, esi
L3:
lea eax, [ebx-1]
mov DWORD PTR [esp], eax
call _fib
sub ebx, 2
add esi, eax
cmp ebx, 1
jg L3
and edi, 1
L2:
lea eax, [esi+edi]
add esp, 16
pop ebx
pop esi
pop edi
ret
L4:
xor esi, esi
jmp L2
So I guess the calling convention is argument at [esp+4]
and return value in eax
. It starts by pushing edi
, esi
and ebx
. Then it claims another 16 bytes for a stack frame, enough for 4 temporary ints. Then edi
is read from [esp+32]
, which is the argument. If the argument is <=1
then it jumps to L4
which zeroes out (?) esi
before jumping back to L2
which sets eax=esi+edi
which is just the argument edi
. If the argument was >1
then the argument is copied into ebx
and zeroes esi
before falling through into L3
. In L3
, it sets eax=ebx-1
and stores the result (n-1) at esp
in the stack frame before recursing to calculate fib(n-1)
. The result is added to esi
, ebx
is set to n-2
and it loops back to L3
if ebx>1
otherwise it extracts the lower bit of edi
before falling through to L2
.
Why is this code so convoluted (e.g. is there a name for an optimization that has been done that I'm not seeing?)?
The recursive calls fib(n-2)
seem to have been replaced with a loop accumulating in esi
but that call wasn't in tail position so how was this done?
What is the purpose of the and edi, 1
?
What is the purpose of the mov DWORD PTR [esp], eax
?
Why is the stack frame so large?
Can you disassemble this algorithm back into C to make it clearer what is going on?
My preliminary impression is that GCC generates pretty poor x86 assembler. In this case, over 2× more code for equal performance (3.25s for fib(40) on this 1.6GHz Atom for both programs). Is that fair?