assembly - 使用 NASM 的回文

Question

我写了一个查找回文的代码。但是我的代码在所有情况下都显示输出“不是 pallindrome”。我的程序如下：

section .data
    a db "mommom",0
    b equ $-a

    msg1 db "is pallindrome",10,0
    msg2 db "is not pallindrome",10,0
    msg3 db "",10,0
section .text
    global main
    extern printf
main:
    nop
    xor eax,eax
    xor ebx,ebx
    mov eax,a       ;starting add
    mov ebx,b
    add eax,ebx
    dec eax         ;will use to indicate the last letter of a

    xor ebx,ebx
    xor edx,edx
    xor ecx,ecx

start:
    inc ecx
    cmp ecx,(b/2)       ;check will run for half of the word
    jle check
    jmp pal
check:  
    mov dl,byte[eax]    ;last letter
    cmp byte[a+ebx],dl  ;frst letter compares with last letter
debug:
    pusha           ;debugging purpose.Used to catch the first letter of a
    push byte[a+ebx]
    push msg3
    call printf
    add esp,8
    popa
checkContinue:
    inc ebx         ;use for check segment
    dec eax
    je start
    jne nonPal
pal:
    pusha
    push msg1
    call printf
    add esp,4
    popa
    jmp done
nonPal:
    pusha
    push msg2
    call printf
    add esp,4
    popa
    jmp done
done:
     nop

Antoine Mathys 已经为我们提供了上述代码的适当版本，指出了此代码中出现的错误。他的评论部分对我们这样的新手来说非常重要。在这里，在上面的这个程序中，我试图打印驻留在 ebx 寄存器中的每个字符，但我没有得到它。如果任何导师可以解决这部分问题，我将不胜感激。它将帮助我学习如何从字符串中获取每个字符。

score 0 · Accepted Answer

这里是：

        BITS 32
        section .data
        string db "mommom"
        length equ $ - string

        msg1 db "is pallindrome",0
        msg2 db "is not pallindrome",0

        section .text
        global main
        extern puts

main:
        mov ebx, string                   ; start of word                                 
        mov eax, (string + length - 1)    ; end of word                                   

        mov ecx, (length / 2)             ; check will run for half of the word           
check:
        mov dl, [ebx]                     ; compare first and last letters                
        cmp [eax], dl
        jne failure
        inc ebx
        dec eax
        loop check

        ;; success                                                                  
        push msg1
        call puts
        add esp,4
        jmp done

failure:
        push msg2
        call puts
        add esp,4

done:
        ret

一些备注：

不要在要测试的单词中添加零。这将使它成为一个非回文。
您的代码充满了不必要的指令（pusha/popa、nop、清除 edx，但仅使用 dl，...）。尽量保持代码尽可能简单。
不要使用异或技巧，写清楚
使用有意义的符号名称
利用汇编算术表达式

score 0 · Accepted Answer

我不断地看到这些回文问题以及来自初学者的笨重代码，这让我对编写一个真正有效的版本产生了兴趣。我的 byte-at-a-time 循环应该在 Intel Haswell 上以每个时钟一次迭代运行，但在早期的 Intel 上较慢（因为循环将超过 4 个融合域 uop，因为背靠背的宏融合有限cmp/jcc）。

另请参阅下面的方法使其不区分大小写。

一次检查一个以上的字节是有效的，即使在中间的交叉处有重叠：

 abcdcba
 abcd
    dcba    ; overlap by one
  bcdc
   cdcb     ; overlap by two

加载一些字节，用、 Silvermont / bswapHaswell movbe、 SSSE3pshufb甚至rol ax,8//反转顺序。然后与应该匹配的相同字节数进行比较。rol eax,16rol ax,8

添加您喜欢的结果打印。我只是返回一个退出状态。

所有这些在 32 位中都一样，但我使用了 64 位，因为寄存器调用约定避免了堆栈操作导致代码混乱。

请注意使用本地标签 ( .label) 以避免使用相同标签名称的相似代码块之间的冲突。

DEFAULT rel
section .text

ALIGN 16
global check_palindrome
                   ; AMD64 SysV calling convention
check_palindrome:  ; (size_t len /*rdi*/, const char *str /*rsi*/)
;;returns bool
    cmp    rdi, 8
    jb   check_palindrome_byte_at_a_time      ; tailcall the version that handles small inputs
    add    rdi, rsi      ; rdi = end pointer
;ALIGN 16  ; probably not worth it, depending on the CPU
.palin_loop:
    lodsq                ; rax = [rsi], rsi+=8.  lodsd/q is only 2 uops on Haswell
    bswap  rax
    sub    rdi, 8
    cmp    rax, [rdi]
    jne  .not_palin
    cmp    rsi, rdi
    jb   .palin_loop          ; stop looping when the pointers cross
    ;; Loop has 7 uops on Haswell, so it can sustain one iteration (8 bytes in each direction) per 2 cycles.
    ;; 9 uops on SnB, where lodsq is 3 uops, and only one of cmp/jcc pairs can macro-fuse.  For SnB, use mov / add instead of lodsq
    ;; with unrolling, we might get this down closer to 2x8B per clock, instead of per 2 clocks.
    ;; or with SSE/AVX, 2x16B per clock.  Probably not 2x32B per clock with AVX, due to needing two shuffles.  (no cross-lane byte shuffle until AVX512)

; input was a palindrome
    mov    eax, 1             ; return true
    ret

.not_palin:
    xor    eax,eax            ; return false
    ret

ALIGN 16
;; helper function with the same signature as the main version
; only needed for small strings, not for unaligned, or not-multiple-of-8
    ; assume that rdi < 2^32 so we can use edi interchangeably, for smaller code-size.
    ; If our prototype was (unsigned int, const char*), we'd have to mov ecx, edi  or something.
    ;  (moving to a different reg is preferable to mov edi,edi  because mov-elimination never works on mov same,same)
check_palindrome_byte_at_a_time:
    test   edi,edi          ; or cmp edi, 1  since every 1-char string is also a palindrome
    jz  .is_palin           ; the empty string is a palindrome

;ALIGN 16
.palin_loop:
    mov    al, [rsi + rdi - 1] ; 2-register addresses can't micro-fuse on SnB-family, but this is a pure load that doesn't need to micro-fuse with anything.
    cmp    al, [rsi]
    jne  check_palindrome.not_palin
    inc    rsi                 ; pointer moves forward
    sub    edi, 2              ; index counts down towards zero twice as fast
    ja   .palin_loop           ; treat counter as unsigned.  Not jae, because a byte is always equal to itself.
;;; Haswell and later can fuse both branches even when they hit the decoders in the same cycle
;;; so this loop should hopefully be 4 fused-domain uops and run at one iteration per clock

.is_palin:
    mov    eax, 1             ; return true
    ret
; .not_palin:  ; shared code with other check_palindrome version



    global main
    extern strlen
ALIGN 16
main:
    ;; return !check_palindrome(strlen(argv[1]), argv[1])

    ;mov    edi, inputstr_len
    ;mov    esi, inputstr
    mov    rdi, [rsi+8]       ; argv[1]
    push   rdi                ; save it, and align the stack
    call   strlen
    mov    rdi, rax
    mov    rsi, [rsp]         ; could pop here and take advantage of the fact that we know check_palindrome doesn't care about the stack
    call   check_palindrome
    pop    rdi                ; shorter than  add rsp,8  and faster on Intel CPUs with a stack engine, given the surrounding instructions
    xor    eax, 1             ; we know check_palin returns a _Bool, so flip it
    ret


;;; Unused, but improved in case you do want to use it
section .rodata
    inputstr    db "mommom"
    inputstr_len equ $-inputstr

    msg_palin   db "is palindrome",10,0
    msg_nopalin db "is not palindrome"   ; newline an terminating zero are on the next line, with their own label
    msg_newline db 10,0

（关于如何在 Godbolt 上main()编写一些来自gcc 输出的灵感。编译器输出通常是一个很好的起点。）

测试和工作：

$ yasm -Worphan-labels -gdwarf2 -felf64 palindrome.asm && gcc palindrome.o -o palindrome

$ ./palindrome '' && echo "true" || echo "false"
true
$ ./palindrome 1 && echo "true" || echo false
true
$ ./palindrome abccba && echo "true" || echo "false"
true          # even size: pair of middle chars
$ ./palindrome abcdcba && echo "true" || echo "false"
true          # odd size: single middle char
$ ./palindrome abcdeba && echo "true" || echo "false"
false

$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb bab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb baa' && echo "true" || echo "false"
false

$ ./palindrome && echo "true" || echo "false"
Segmentation fault (core dumped)   # main doesn't check argc

不区分大小写：

除了字节反转一个块，在比较之前强制两个块的字母字符为小写。使用 ASCII，这很容易。我链接的那个答案甚至有一个大小写翻转的 SSE 向量实现，可以对其进行修改以仅将所有字母字符强制为相同的大小写（或者将最后一个替换为pxor，por或者使用与比较结果的混合作为控件，而不是pandn/ por.)

展开：

除了使用位移（[rsi+8]、、[rdi-8]等）来保存加/减指令外，您还可以通过 ORing 或 ANDing 一些比较结果一起减少分支的数量。这可能会或可能不会保存任何东西。在完全不受部分寄存器合并减速的 CPU （例如 AMD）上，这可能是一个胜利：

.palin_loop:
    xor   eax,eax   ; inside the loop to break the false dependency
    xor   edx,edx

    movbe  ... / cmp   ...       ; movbe is a load and endian swap in one insn.  On Haswell it just saves code-size, not uops.  On Silvermont it's a win
    setne  dl
    movbe  rcx, [rsi+rdi-16] / cmp  [rsi+8], rcx
    setne  dh
    shl    edx, 16    ; DH merging takes a cycle on Haswell
    ... repeat setting dl/dh so edx holds 4 results from four 8B checks

    movbe ... / cmp   ...
    setne  al
    movbe ... / cmp   ...
    setne  ah
    shl    eax, 16
    ... repeat setting al/ah

    or    eax, edx             ; or unroll twice as much, and use rax,rdx
    jnz  .not_palin

使用 SSE，您已经在整数 reg 中比较结果，pmovmskb而不需要 a setcc，将结果与运算结果放在一起是更大的胜利。

AVX：

看起来像一次字节版本的循环结构是一个想法，我们使用 pshufb 来反转整个向量的顺序。pmovmskb和测试位掩码是处理矢量比较结果的一个非常标准的习惯用法。它比快PTEST，因为PTEST是 2 微指令并且不能进行宏熔断。

DEFAULT rel

section .rodata
ALIGN 16
bytereverse_shuffle db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0

section .text

ALIGN 16
global check_palindrome_AVX1
check_palindrome_AVX1:
    cmp         rdi, 32
    jb        check_palindrome_byte_at_a_time   ; tailcall the version that handles small inputs

    vmovdqa    xmm5, [bytereverse_shuffle]

;;ALIGN 16 ; NASM syntax doesn't have an equivalent for the GNU assembler's align-if-padding-will-be-less-than-N-bytes .p2align 4,,8   (4 meaning 2^4 for power-of-2 align)
;; loop alignment is less important for tiny loops that should fit in the loop buffer

.palin_loop:
    vmovdqu     xmm0, [rsi + rdi - 16]
    vpshufb     xmm0, xmm0, xmm5  ; bswap
;;; for AVX2 ymm:   vperm2i128 ymm0, ymm0, ymm0, 1  ; swap hi and lo lanes
    vpcmpeqb    xmm0, xmm0, [rsi] ; all we're gaining from AVX is folding this potentially-unaligned memory operand.  Everything else works with SSE
    vpmovmskb   eax, xmm0
    cmp         eax, 0xffff     ; check that all 16bits are set -> all 16bytes matched
    jne       .not_palin
    add         rsi, 16         ; pointer moves forward
    sub         rdi, 32         ; index counts down towards zero twice as fast
    jae       .palin_loop       ; not ja: we still need to check 16 reversed bytes against the same 16 bytes if we exactly meet in the middle
;;; fused-domain uops: 7 for Haswell, 8 for SnB, so it should run one iteration per 2c
;;; An SSSE3 version that needs an unaligned load separate from pcmpeq will be 8 uops (or 9 on pre-Haswell)

.is_palin:
    mov         eax, 1
    ret
.not_palin:
    xor         eax,eax
    ret

assembly - 使用 NASM 的回文

2 回答 2

测试和工作：

不区分大小写：

展开：

AVX：

Related

Reference