5

I am learning x86 assembler in order to write a compiler. In particular, I'm taking a variety of simple recursive functions and feeding them through different compilers (OCaml, GCC etc.) in order to get a better understanding of the kinds of assembler generated by different compilers.

I've got a trivial recursive integer Fibonacci function:

int fib(int x) { return (x < 2 ? x : fib(x-1)+fib(x-2)); }

My hand-coded assembly looks like this:

fib:
    cmp eax, 2
    jl  fin
    push    eax
    dec eax
    call    fib
    push    eax
    mov eax, [esp+4]
    add eax, -2
    call    fib
    add eax, [esp]
    add esp, 8
fin:
    ret

Compiling that function to Intel-syntax assembler using gcc -O2 produces this enigmatic code:

_fib:
    push    edi
    push    esi
    push    ebx
    sub esp, 16
    mov edi, DWORD PTR [esp+32]
    cmp edi, 1
    jle L4
    mov ebx, edi
    xor esi, esi
L3:
    lea eax, [ebx-1]
    mov DWORD PTR [esp], eax
    call    _fib
    sub ebx, 2
    add esi, eax
    cmp ebx, 1
    jg  L3
    and edi, 1
L2:
    lea eax, [esi+edi]
    add esp, 16
    pop ebx
    pop esi
    pop edi
    ret
L4:
    xor esi, esi
    jmp L2

So I guess the calling convention is argument at [esp+4] and return value in eax. It starts by pushing edi, esi and ebx. Then it claims another 16 bytes for a stack frame, enough for 4 temporary ints. Then edi is read from [esp+32], which is the argument. If the argument is <=1 then it jumps to L4 which zeroes out (?) esi before jumping back to L2 which sets eax=esi+edi which is just the argument edi. If the argument was >1 then the argument is copied into ebx and zeroes esi before falling through into L3. In L3, it sets eax=ebx-1 and stores the result (n-1) at esp in the stack frame before recursing to calculate fib(n-1). The result is added to esi, ebx is set to n-2 and it loops back to L3 if ebx>1 otherwise it extracts the lower bit of edi before falling through to L2.

Why is this code so convoluted (e.g. is there a name for an optimization that has been done that I'm not seeing?)?

The recursive calls fib(n-2) seem to have been replaced with a loop accumulating in esi but that call wasn't in tail position so how was this done?

What is the purpose of the and edi, 1?

What is the purpose of the mov DWORD PTR [esp], eax?

Why is the stack frame so large?

Can you disassemble this algorithm back into C to make it clearer what is going on?

My preliminary impression is that GCC generates pretty poor x86 assembler. In this case, over 2× more code for equal performance (3.25s for fib(40) on this 1.6GHz Atom for both programs). Is that fair?

4

3 回答 3

8

除了上面的注释之外,请注意,通过替换,递归展开为尾调用:

return x < 2 ? x : fib(x - 2) + fib(x - 1);

和:

if ((xprime = x) < 2) {
    acc = 0;
} else {
    /* at this point we know x >= 2 */
    acc = 0; /* start with 0 */
    while (x > 1) {
       acc += fib(x - 1); /* add fib(x-1) */
       x -= 2; /* now we'll add fib(x-2) */
    }
    /* so at this point we know either x==1 or x==0 */
    xprime = x == 1 ? 1 : 0; /* ie, x & 1 */
}
return xprime + acc;

我怀疑这个相当棘手的循环是由多个优化步骤引起的,并不是说我从 gcc 2.3 开始就一直在摆弄 gcc 优化(现在内部完全不同了!)。

于 2012-04-07T22:07:43.380 回答
2

很简单,fib(x-2)is equal to fib(x-3) + fib(x-4)fib(x-4)is equal to fib(x-5) + fib(x-6)etc. 所以fib(x)计算为fib(x-1) + fib(x-3) + fib(x-5) + ... + fib(x&1)( fib(x&1)equals x&1) 即 gcc 已经内联了对 的调用fib(x-2),这对递归函数来说是一件非常聪明的事情。

于 2012-04-07T22:14:19.170 回答
2

第一部分是确保根据调用约定应保留的寄存器不会被丢弃。我猜这里使用的调用约定是cdecl.

_fib:
    push    edi
    push    esi
    push    ebx
    sub esp, 16

DWORD PTR[esp+32]是你的x

    mov edi, DWORD PTR [esp+32]
    cmp edi, 1
    jle L4

如果x小于或等于 1(这对应于您的x < 2),则转到L4哪个是:

L4:
    xor esi, esi
    jmp L2

这将归零esi并分支到L2

L2:
    lea eax, [esi+edi]
    add esp, 16
    pop ebx
    pop esi
    pop edi
    ret

这将eax(返回值)设置为esi+edi. 由于esi已经是 0,edi所以在 0 和 1 的情况下才加载。这对应于x < 2 ? x.

现在我们看看xis not的情况< 2

    mov ebx, edi
    xor esi, esi
L3:
    lea eax, [ebx-1]
    mov DWORD PTR [esp], eax
    call    _fib

首先,x复制到ebx,然后esi归零。

接下来,eax设置为x - 1。这个值被移到栈顶并被_fib调用。这对应于fib(x-1)

    sub ebx, 2
    add esi, eax

ebx这从( x) 中减去 2 。然后(来自call 的eax返回值被添加到之前设置为 0 的 )。因此现在持有 的结果。_fibesiesifib(x-1)

    cmp ebx, 1
    jg  L3
    and edi, 1

ebx与 1 进行比较。如果大于 1,则循环回L3. 否则(它是 0 或 1 的情况),我们执行and edi, 1并落入L2(我们之前已经分析过它的作用)。and edi, 1相当于执行%2on x

从高层次来看,这就是代码的作用:

  • 设置堆栈帧并保存寄存器
  • 如果x < 2,则返回x
  • 继续调用和求和fib(x-...),直到x小于 2。在这种情况下,通过x < 2案例。

优化之处在于 GCCx >= 2通过循环而不是递归来展开这些情况。

于 2012-04-07T22:15:29.357 回答