assembly - 装配中的 FizzBuzz - 分段错误

Question

我正在尝试在 Assembly 中编写 FizzBuzz，但我一直看到分段错误。到目前为止，我已经确定这不是我的打印例程（因为我已经删除了它们的内容并且问题仍然存在）并且错误隐藏在主函数的某处。

当我运行程序时，我得到了这个输出：

fizzSegmentation fault

让我相信使用除法和查找余数不是问题。但我可能是错的，我已经两年没做过组装了......

SECTION .data
global _start
    fizz: db "fizz", 4
    buzz: db "buzz", 4

SECTION .bss
    counter: resb    1

SECTION .text
_start:

    mov ax,0
    mov [counter],ax

main_loop:

    cmp ax,100          ;from 0 to 100
    je  exit            ;
    mov bl,3            ;divisor
    mov ah,0            ;here will be a remainder
    div bl              ;divide
    cmp ah,0            ;compare the remainder with 0
    je  print_fizz      ;print fizz if they equal
    mov bl,5            ;new divisor
    mov ah,0            ;do I have to do it every time?
    div bl              ;divide
    cmp ah,0            ;compare the remainder with 0
    je  print_buzz      ;print buzz if they equal
    jmp print_ax        ;print contents of ax if not
    inc ax              ;increment ax
    jmp main_loop       ;jump to label

print_ax:
    ret

print_fizz:
    ret

print_buzz:
    ret

exit:
    mov rax,1
    mov rbx,0
    int 80h
    ret

我正在编译使用：

yasm -f elf64 -o fizzbuzz.o fizzbuzz.asm
ld -d -o fizzbuzz fizzbuzz.o

score 4 · Accepted Answer

这个答案比我计划的要长得多，有点像编写高效 asm 的教程。即如何使一个简单的问题复杂化。

如果有人对尝试的实现的代码审查感兴趣，并且有很多 asm 技巧的版本：

有很多小方法可以做得更好，例如 keep 5inbh和3in bl。您不必总是使用div bl. AMD64 有 20 个单字节寄存器。（al/ah、bl/bh、cl/ch、dl/dh（无 REX）和 sil、dil、... r15b（需要 REX））。

使用 16 位计数器至少会浪费字节（操作数大小的前缀），并且会导致速度变慢。使用mov reg,0不好。尽可能将条件分支放在循环的底部。

mov rax, 1与相比mov eax, 1，这是浪费指令字节，并且标记为yasm，它不会在汇编时为您优化。（nasm本身就是。）设置 64 位寄存器，然后使用int 0x8032 位兼容性 ABI 更加愚蠢。

首先将 16 位计数器存储到内存中是愚蠢的，但是将其存储到只保留一个字节的地址是自找麻烦。

除了小东西， FizzBuzz(3,5)小到足以展开并div完全避免一些 s。使用汇编器宏，您可以轻松地生成一个完全展开的循环，每个循环具有 LCM(fizz,buzz) 输出（在这种情况下为 15 个）；足以让模式重复，所以你不需要任何条件。

div您可以通过使用向下计数器检测count%5==0和来避免count%3==0展开而不展开。 @anatolyg 的 16 位 DOS 代码高尔夫 FizzBuzz 做到了。这是一种非常常见的技术，每 N 次发生其他事情就做某事。例如，性能计数器事件以这种方式工作。

这是我对高效 FizzBuzz（用于 AMD64 Linux）的尝试，不使用任何库。只有`write(2)`和`exit_group(2)`

没有编译器，所以如果你想要好的代码，你必须自己编写。您不能只希望编译器会在循环中做一些好事（对于大多数编译器i%3来说它无论如何都不会）。

在我编写代码时，代码发生了很大变化。像往常一样，当你看到你的第一个想法需要比你希望的更多或更慢的指令时，开始实施一种方法会给你更好的想法。

我展开 3 (Fizz) 以删除所有对counter%3. 我counter%5用从 5 开始的倒计时而不是除法来处理支票。这仍然需要相当多的逻辑，完全展开到模式重复的点（LCM（3,5））就会消失。整数到 ASCII 数字的代码可以在一个函数中，或者内联到展开的循环中以获得非常臃肿的代码。

我将所有内容都保存在寄存器中（包括常量fizz\n和buzz\n）。没有负载，只存储到缓冲区中。许多寄存器在循环外设置一次，而不是mov在使用前使用 -immediate。这需要良好的评论来跟踪你把什么放在哪里。

write(2)我将字符附加到我们在每一fizzbuzz\n行之后的缓冲区。这是程序逻辑中自然发生的最长循环，意味着我们只需要syscall在一个地方编写代码。

在可能写入文件或管道的实际程序中，最好使用 C stdio 的策略，即在这种情况下使用更大的缓冲区。（许多~100 字节的写入比更少的 4096B 写入要糟糕得多。）不过，我认为这是在传统 printf 每次迭代或将整个字符串累积到一个大缓冲区之间的一个有趣的选择。我使用静态缓冲区而不是保留堆栈空间，因为我正在编写整个程序，而不是应该避免在返回后浪费内存的函数。此外，它让我可以使用 32 位操作数大小进行指针增量，以节省代码字节（REX 前缀）。

累积多个块非常容易，直到您到达下一组可能超过缓冲区末尾的点。即比较当前位置与buffer_end - BUZZMOD*FIZZMOD*9。优化 I/O 系统调用显然是一个广泛的话题，这个版本足以演示在缓冲区中累积字符串。

;  for (count=1..100):
;  if(count%3 == 0) { print_fizz(); }
;  if(count%5 == 0) { print_buzz(); } else {
;       if(count%3 && count%5) print(count);
;; }
;  print(newline)

; We don't need pointers to these strings at all;  The strings are immediate data for a couple mov instructions
;SECTION .rodata        ; put constants in .rodata.
;    fizz: db "fizz"    ; No idea what the trailing  4  was for
;    buzz: db "buzz"

FIZZMOD equ 3                   ; only 3 works, but it would be easy to use a loop
BUZZMOD equ 5                   ; any value works
LASTCOUNT equ 100    ; max 100: we only handle two decimal digits.
; TODO: cleanup that can handle LASTCOUNT%FIZZMOD != 1 and LASTCOUNT%BUZZMOD != 0


SECTION .bss
;;; generate a string in this buffer.  (flush it with write(2) on "fizzbuzz" lines)
;    buf: resb    4096
buf: resb    FIZZMOD * BUZZMOD * 9     ; (worst case: every line is "fizzbuzz\n")

SECTION .text
global _start
_start:

    ; args for write(2).  (syscall clobbers rcx/r11,  and rax with the return value)
    mov   edi, 1                ; STDOUT_FILENO.  also happens to be __NR_write in the AMD64 Linux ABI
    mov   esi, buf              ; static data lives in the low 2G of address space, so we don't need a 64bit mov
    ;; edx = count.             ; calculated each iteration
    ;; mov eax, edi             ; also needed every time.   saves 3B vs  mov eax, imm32

    ; 'fizz' is only used once, so we could just store with an immediate there.  That wouldn't micro-fuse, and we'd have to do the newline separately
    mov   r10b, 10      ; base 10
    ;;mov   r14d, BUZZMOD  ; not needed, we don't div for this
    mov   r12, 'fizz' | 10<<32      ; `fizz\n`, but YASM doesn't support NASM's backquotes for \-escapes
    mov   r13, 'buzz' | 10<<32      ; `buzz\n`.  When buzz appears, it's always the end of a line


;;;;;;;; Set up for first iteration
    mov   ebp, BUZZMOD          ; detect count%BUZZMOD == 0 with a down-counter instead of dividing
    mov   ebx, 1                ; counter starts at 1
    mov   edx, esi              ; current output position = front of buf
ALIGN 16
main_loop:

    ;; TODO: loop FIZZMOD-1 times inside buzz_or_number, or here
    ;; It doesn't make much sense to unroll this loop but not inline buzz_or_number :/
    call  buzz_or_number
    inc   ebx

    call  buzz_or_number
    add   ebx, 2                ; counter is never printed on Fizz iterations, so just set up for next main_loop

    ;; Fizz, and maybe also Buzz
    mov   qword [rdx], r12      ; Fizz with a newline
    add   edx, 5                ; TODO: move this after the branch; adjust the offsets in .fizzbuzz

    dec   ebp
    jz   .fizzbuzz

;;.done_buzz:   ; .fizzbuzz duplicates the main_loop branch instead of jumping back here
    cmp   ebx, LASTCOUNT-FIZZMOD
    jbe   main_loop
;;;;;;;;;; END OF main_loop


.cleanup:
;;;;;;;;;;;;;;;;;;;;;  Cleanup after the loop
    ; hard-code the fact that 100 % FIZZMOD = 1 more line to print,
    ; and that 100 % BUZZMOD = 0, so the line is "buzz\n"

    mov   eax, edi              ; __NR_write
    mov   [rdx], r13            ; the final "buzz\n".
    sub   edx, buf - 5          ; write_count = current_pos+5 - buf.
    syscall                     ; write(1, buf, p - buf).
    ;; if buf isn't static, then use  add   edx, 5 / sub   edx, esi

    xor edi, edi
    mov eax, 231    ;  exit_group(0).  same as eax=60: exit() for a single-threaded program
    syscall


;;;;; The fizzbuzz case from the loop
.fizzbuzz:
;; count%BUZZMOD == 0:   rdx points after the \n at the end of fizz\n, which we need to overwrite

;; this is a macro so we can use it in buzz_or_number, too, where we don't need to back up and overwrite a \n
%macro  BUZZ_HIT 1
    mov   [rdx - %1], r13       ; buzz\n.  Next line will overwrite the last 3 bytes of the 64b store.
    add   edx, 5 - %1
    mov   ebp, BUZZMOD          ; reset the count%BUZZMOD down-counter
%endmacro

    BUZZ_HIT 1                  ; arg=1 to back up and overwrite the \n from "fizz\n"

    sub   edx, esi              ; write_count = current_pos - buf
    mov   eax, edi              ; __NR_write
    syscall                     ; write(1, buf, p - buf).  clobbers only rax (return value), and rcx,r11
    mov   edx, esi              ; restart at the front of the buffer

;;; tail-duplication of the main loop, instead of jmp back to the cmp/jbe
;;; could just be a jmp main_loop, if we check at assemble time that  LASTCOUNT % FIZZMOD != 0 || LASTCOUNT % BUZZMOD != 0
    cmp   ebx, LASTCOUNT-FIZZMOD
    jbe   main_loop
    jmp   .cleanup

;;;;;;;;;;;;;;;;;;;;;;; buzz_or_number: called for non-fizz cases
; special calling convention: uses (without clobbering) the same regs as the loop
;; modifies: BUZZMOD down-counter, output position pointer
;; clobbers: rax, rcx
ALIGN 32
buzz_or_number:
    dec   ebp
    jnz  .no_buzz              ; could make this part of the macro, but flow-control inside macros is probably worse than duplication

;; count%BUZZMOD == 0:  append "buzz\n" to the buffer and reset the down-counter
    BUZZ_HIT  0                 ; back up 0 bytes before appending
    ret

.no_buzz:             ;; get count as a 1 or 2-digit ASCII number
    ;; assert(ebx < 10);   We don't handle 3-digit numbers

    mov   eax, ebx
    div   r10b                  ; al = count/10 (first (high) decimal digit), ah = count%10 (second (low) decimal digit).
    ;; x86 is little-endian, so this is in printing-order already for storing eax

    ;movzx eax, ax            ; avoid partial-reg stalls on pre-Haswell
    ;; convert integer digits to ASCII by adding '0' to al and ah at the same time, and set the 3rd byte to `\n`.
    cmp   ebx, 9                ; compare against the original counter instead of the div result, for more ILP and earlier detection of branch misprediction
    jbe   .1digit               ; most numbers from 1..100 are 2-digit, so make this the not-taken case
    add   eax, 0x0a3030   ;;  `00\n`: converts 2 integer digits -> ASCII
    ;; eax now holds the number + newline as a 3-byte ASCII string
    mov   [rdx], eax
    add   edx, 3
    ret

.1digit:
;; Could use a 16bit operand-size here to avoid partial-reg stalls, but an imm16 would LCP-stall on Intel.
    shr   eax, 8                ; Shift out the leading 0
    add   eax, 0x000a30   ;; 1-digit numbers
    ;; eax now holds the number + newline as a 2-byte ASCII string
    mov   [rdx], ax
    add   edx, 2
    ret

它是这样运行的：

$ strace ./fizzbuzz > /dev/null
execve("./fizzbuzz", ["./fizzbuzz"], [/* 69 vars */]) = 0
write(1, "1\n2\nfizz\n4\nbuzz\nfizz\n7\n8\nfizz\nbu"..., 58) = 58
write(1, "16\n17\nfizz\n19\nbuzz\nfizz\n22\n23\nfi"..., 63) = 63
write(1, "31\n32\nfizz\n34\nbuzz\nfizz\n37\n38\nfi"..., 63) = 63
write(1, "46\n47\nfizz\n49\nbuzz\nfizz\n52\n53\nfi"..., 63) = 63
write(1, "61\n62\nfizz\n64\nbuzz\nfizz\n67\n68\nfi"..., 63) = 63
write(1, "76\n77\nfizz\n79\nbuzz\nfizz\n82\n83\nfi"..., 63) = 63
write(1, "91\n92\nfizz\n94\nbuzz\nfizz\n97\n98\nfi"..., 40) = 40
exit_group(0)                           = ?

正确性检查：

./fizzbuzz | diff - <(perl -E'say((fizz)[$_%3].(buzz)[$_%5]or$_)for+1..100')
# no output = no difference

展开 Buzz (5) 并为 Fizz 使用向下计数器可能会更糟。我的版本有一个 64 位存储，fizz\n\0\0\0然后有一个分支来决定是否存储buzz\n\0\0\0重叠生成fizzbuzz\n. 另一种方式是有一个分支来决定是否存储fizz（不需要换行符，所以它可以是一个 32 位存储）。然后它将无条件存储buzz\n\0\0\0. 然而，由于 FIZZMOD 比 BUZZMOD 小，这意味着递减计数器的重置更频繁，并且需要更多检查以查看我们是否需要在本次迭代中打印数字而不是字符串。每三行已知fizz\n或fizzbuzz\n意味着更简单的代码更频繁地运行。

如果重叠存储是一个问题，那么整个算法就搞砸了，这只是众多算法之一。此外，我们可以在存储和添加 5 之前进行分支fizz\n。然后在这种fizzbuzz\n情况下，我们进行两次存储并添加 9。这也将 dec/jcc 与底部的 cmp/jcc 分开main_loop，因此它们有望同时进行宏熔断在前哈斯韦尔。IIRC，一些 CPU 有分支预测器，它们真的不喜欢多个非常接近的分支。

进一步的改进，留给读者练习：

内联buzz_or_number，并可能将其变成一个循环（FIZZMOD-1 迭代）
除此之外，它可能分支更少，以及其他小的改进。这是一个版本 1.1：工作，测试，在编写这个答案时添加了一些评论和观察，但实际上并没有对我最初认为的代码进行太多改进，以查看它是否有效。
通过为最后几行编写清理循环（或汇编器宏）来使其更灵活LASTCOUNT % FIZZMOD，而不是假设它是 1 行。清理代码是展开的缺点。
我使用 a divby 10将计数器转换为字符串。更好的实现将使用乘法逆，例如编译器为小的常量除数生成（在这种情况下使用 LEA 实现）。

另一个值得考虑的策略是减少增加一系列 ASCII 数字（存储在寄存器中）的强度。这种技术将更难扩展到具有更多数字的数字。按打印顺序存储它们（低字节中的最高有效数字）会使数字之间的进位对我们不利而不是对我们不利。（例如，如果它们按自然顺序排列，您可以add eax, 256-10通过进位更正低位并增加高位。）保持这种方式可能是值得的，但要存储 BSWAP。将一个\n嵌入在寄存器中以便只需要一个商店可能不值得。检测和处理 1 位数字变成 2 位数字已经够糟糕的了。

在 32 位模式下，我们可以使用AAA指令在递增后进行十进制进位。但是，尽管有助记符，但它适用于 BCD ( 0-9)，而不是 ASCII ( '0'-'9')，并且似乎不容易将进位传播到第 3 位。难怪 AMD 为 AMD64 删除了它。它检查AF标志以检测低 4 位的执行，但这仅适用于DAA，您将两个 BCD 数字打包到一个字节中，并且当您添加未知值而不是递增时。在这种情况下，您只需检查al >= 10.

我的第一个版本几乎第一次就成功了（在修复了几个语法错误以便它可以组装之后，以及一个需要几分钟调试 IIRC 的愚蠢崩溃之后）：它打印fizz\nbuzz\n在fizzbuzz\n案例中，并且它颠倒了数字。我一直忘记数字字符串需要首先存储最重要的数字，而不是小端二进制整数中的字节。

做事的替代方法

我决定不使用1 位与 2 位 ASCII 转换代码的无分支版本，因为它需要大量指令。此外，分支应该很好地预测。

  ;; Untested
  buzz_or_number:
  ...
  .no_buzz:
    ... 
    div   r10b
 DECIMAL_TO_ASCII_NEWLINE_2DIGIT equ 0x0a3030   ; add '0' to two unpacked decimal digits, and a newline
 DECIMAL_TO_ASCII_NEWLINE_1DIGIT equ 0x000a30

    ;; hoist this out of the loop: mov   r15d, DECIMAL_TO_ASCII_NEWLINE_2DIGIT - DECIMAL_TO_ASCII_NEWLINE_1DIGIT
    xor   ecx,ecx
    cmp   ah, 1            ; set CF if ah=0 (1 digit number), otherwise clear it.  This allows sbb for a conditional add, instead of setcc
    cmovae ecx, r15d       ; 0 or the difference from 1digit to 2digit

    lea   eax, [rax+rcx + DECIMAL_TO_ASCII_NEWLINE_1DIGIT]  ; rax+=0x0a3030 or 0x000a30, without clobbering flags
    mov   [rdx], eax
    sbb   edx, -3          ; add 2 (-(-3) - 1) or 3.
    ret

在 32 位（和 16 位）模式下，有一条div指令采用立即操作数，并AL用作被除数，而不是AX. 它被称为AAM，并与其他 BCD/ASCII 指令一起为 AMD64 删除。它可以方便地测试被 5 整除，而无需为除数占用寄存器或在循环内浪费一条指令。它比略快div r/m8，并根据余数设置标志（在al：与相比，它的输出反转div）。

Anatolyg 打高尔夫球的 FizzBuzz在循环中使用 AAM，shr ax, 8以相反的顺序一次生成一个数字，存储和递减指针。

这个版本要复杂得多，因为它用于AAM检查 count%5，然后将其处理为 count%10，而不是进行单独的除法来获取 ASCII 数字。

 ;; Untested
buzz_or_number_div:
    mov   eax, ebx

    aam   5               ; al = al%5  ah = al/5.  (opposite locations from div), and sets flags according to the remainder.
    jz    print_buzz      ; tailcall


    ; fall through into print_counter
;print_counter:
    ; maybe use the result of div by 5 to get division by 10?  
    ; shifting the low bit of the quotient into bit 4 of the remainder should be faster than dividing again.
    ;; after AAM: ah = 5bit quotient (qqqqQ), al = 3bit remainder(RRR)
    ;; starting point:     ; AX = [ 000qqqqQ 00000RRR ]
    ;; desired = byte swapped as well: [ 0000QRRR 0000qqqq ]
    shl   al, 5            ; AX = [ 000qqqqQ RRR00000 ]
    shr   ax, 1            ; AX = [ 0000qqqq QRRR0000 ]
    ror   ax, 8            ; AX = [ QRRR0000 0000qqqq ]  ; simple byte-swap
    shr   ah, 4            ; AX = [ 0000QRRR 0000qqqq ]

    add   eax, ...;  convert to ascii
    ...
    ret

    ; those instructions are all single-uop 1c latency on SnB-family, but pre-Haswell will insert extra merging uops.  (And stall while doing so, on pre-SnB).
    ; and there's another partial-reg stall when we read eax

    ; It might be possible to do this bit manipulation with fewer operations, or maybe different ones.  (maybe copy ax to cx, so we can move from cl or ch to al or ah?)

;    shr   ah, 1           ; AX = [ 0000qqqq 00000RRR ]  CF=Q   ; then what? setc/shift/or?  rcl is slow, too.
;    rorx  eax, eax, 32-4  ; AX = [ qqqq0000 0RRR0000 ]  CF=Q
;  nope, seems a dead end

;    shl   ah, 3           ; AX = [ qqqqQ000 00000RRR ]
;    ror   ax, 7           ; AX = [ 0000RRRq qqqQ0000 ]
;    shr   al, 4           ; AX = [ 0000RRRq 0000qqqQ ]
;  oops, no, shifts the wrong way.

;    shl   ah, 3           ; AX = [ qqqqQ000 00000RRR ]
;    or    ah, al          ; AX = [ qqqqQRRR 00000RRR ]
;    xor   al,al           ; AX = [ qqqqQRRR 00000000 ]
;    rol   ax, 4           ; AX = [ QRRR0000 0000qqqq ]
;    shr   ah, 4           ; AX = [ QRRR0000 qqqq0000 ]
; only 3 shifts, but still partial-reg heavy.  Interesting on Haswell

;    ror   ax, 9           ; AX = [ Q00000RR R000qqqq ]  CF=Q

score 2 · Accepted Answer

这导致了分段错误：

...
    je  print_fizz      ;print fizz if they equal
...
    je  print_buzz      ;print buzz if they equal
    jmp print_ax        ;print contents of ax if not
...

print_ax:
    ret

print_fizz:
    ret

print_buzz:
    ret
...

由于您跳转到函数，因此ret没有返回地址，并且会在任何地方返回。将其更改为call/ret-pair：

...
;   je  print_fizz      ;print fizz if they equal
    jne .1              ;skip if not equal
    call print_fizz
    .1:
...

;   je  print_buzz      ;print buzz if they equal
    jne .2              ;skip if not equal
    call print_buzz
    .2:

;   jmp print_ax        ;print contents of ax if not
    call print_ax
...

这将导致无限循环：

mov ax,0
mov [counter],ax

main_loop:

    cmp ax,100          ;from 0 to 100
    je  exit
    ...
    mov ah,0            ;here will be a remainder
    div bl              ;divide
    ...
    mov ah,0            ;do I have to do it every time?
    div bl              ;divide
    ...
    inc ax              ;increment ax
    jmp main_loop       ;jump to label

AX更改其值并且不适合保存循环计数器。我建议：

...
main_loop:

;   cmp ax,100          ;from 0 to 100
    cmp byte [counter], 100
...
;   inc ax              ;increment ax
    inc byte [counter]
    jmp main_loop       ;jump to label
...

score 1 · Accepted Answer

使用调试器单步执行您的代码，看看哪里出错了。

一眼就能看出你正在破坏ax（也许你不知道它ax是由和组成的ah？al）。此外，您正在跳转到函数而不是调用它们，这可能是故障的原因。

assembly - 装配中的 FizzBu​​zz - 分段错误

3 回答 3

这是我对高效 FizzBu​​zz（用于 AMD64 Linux）的尝试，不使用任何库。只有write(2)和exit_group(2)

进一步的改进，留给读者练习：

做事的替代方法

Related

Reference

assembly - 装配中的 FizzBuzz - 分段错误

这是我对高效 FizzBuzz（用于 AMD64 Linux）的尝试，不使用任何库。只有`write(2)`和`exit_group(2)`