我不断地看到这些回文问题以及来自初学者的笨重代码,这让我对编写一个真正有效的版本产生了兴趣。我的 byte-at-a-time 循环应该在 Intel Haswell 上以每个时钟一次迭代运行,但在早期的 Intel 上较慢(因为循环将超过 4 个融合域 uop,因为背靠背的宏融合有限cmp/jcc)。
另请参阅下面的方法使其不区分大小写。
一次检查一个以上的字节是有效的,即使在中间的交叉处有重叠:
abcdcba
abcd
dcba ; overlap by one
bcdc
cdcb ; overlap by two
加载一些字节,用 、 Silvermont / bswap
Haswell movbe
、 SSSE3pshufb
甚至rol ax,8
//反转顺序。然后与应该匹配的相同字节数进行比较。rol eax,16
rol ax,8
添加您喜欢的结果打印。我只是返回一个退出状态。
所有这些在 32 位中都一样,但我使用了 64 位,因为寄存器调用约定避免了堆栈操作导致代码混乱。
请注意使用本地标签 ( .label
) 以避免使用相同标签名称的相似代码块之间的冲突。
DEFAULT rel
section .text
ALIGN 16
global check_palindrome
; AMD64 SysV calling convention
check_palindrome: ; (size_t len /*rdi*/, const char *str /*rsi*/)
;;returns bool
cmp rdi, 8
jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs
add rdi, rsi ; rdi = end pointer
;ALIGN 16 ; probably not worth it, depending on the CPU
.palin_loop:
lodsq ; rax = [rsi], rsi+=8. lodsd/q is only 2 uops on Haswell
bswap rax
sub rdi, 8
cmp rax, [rdi]
jne .not_palin
cmp rsi, rdi
jb .palin_loop ; stop looping when the pointers cross
;; Loop has 7 uops on Haswell, so it can sustain one iteration (8 bytes in each direction) per 2 cycles.
;; 9 uops on SnB, where lodsq is 3 uops, and only one of cmp/jcc pairs can macro-fuse. For SnB, use mov / add instead of lodsq
;; with unrolling, we might get this down closer to 2x8B per clock, instead of per 2 clocks.
;; or with SSE/AVX, 2x16B per clock. Probably not 2x32B per clock with AVX, due to needing two shuffles. (no cross-lane byte shuffle until AVX512)
; input was a palindrome
mov eax, 1 ; return true
ret
.not_palin:
xor eax,eax ; return false
ret
ALIGN 16
;; helper function with the same signature as the main version
; only needed for small strings, not for unaligned, or not-multiple-of-8
; assume that rdi < 2^32 so we can use edi interchangeably, for smaller code-size.
; If our prototype was (unsigned int, const char*), we'd have to mov ecx, edi or something.
; (moving to a different reg is preferable to mov edi,edi because mov-elimination never works on mov same,same)
check_palindrome_byte_at_a_time:
test edi,edi ; or cmp edi, 1 since every 1-char string is also a palindrome
jz .is_palin ; the empty string is a palindrome
;ALIGN 16
.palin_loop:
mov al, [rsi + rdi - 1] ; 2-register addresses can't micro-fuse on SnB-family, but this is a pure load that doesn't need to micro-fuse with anything.
cmp al, [rsi]
jne check_palindrome.not_palin
inc rsi ; pointer moves forward
sub edi, 2 ; index counts down towards zero twice as fast
ja .palin_loop ; treat counter as unsigned. Not jae, because a byte is always equal to itself.
;;; Haswell and later can fuse both branches even when they hit the decoders in the same cycle
;;; so this loop should hopefully be 4 fused-domain uops and run at one iteration per clock
.is_palin:
mov eax, 1 ; return true
ret
; .not_palin: ; shared code with other check_palindrome version
global main
extern strlen
ALIGN 16
main:
;; return !check_palindrome(strlen(argv[1]), argv[1])
;mov edi, inputstr_len
;mov esi, inputstr
mov rdi, [rsi+8] ; argv[1]
push rdi ; save it, and align the stack
call strlen
mov rdi, rax
mov rsi, [rsp] ; could pop here and take advantage of the fact that we know check_palindrome doesn't care about the stack
call check_palindrome
pop rdi ; shorter than add rsp,8 and faster on Intel CPUs with a stack engine, given the surrounding instructions
xor eax, 1 ; we know check_palin returns a _Bool, so flip it
ret
;;; Unused, but improved in case you do want to use it
section .rodata
inputstr db "mommom"
inputstr_len equ $-inputstr
msg_palin db "is palindrome",10,0
msg_nopalin db "is not palindrome" ; newline an terminating zero are on the next line, with their own label
msg_newline db 10,0
(关于如何在 Godbolt 上main()
编写一些来自gcc 输出的灵感。编译器输出通常是一个很好的起点。)
测试和工作:
$ yasm -Worphan-labels -gdwarf2 -felf64 palindrome.asm && gcc palindrome.o -o palindrome
$ ./palindrome '' && echo "true" || echo "false"
true
$ ./palindrome 1 && echo "true" || echo false
true
$ ./palindrome abccba && echo "true" || echo "false"
true # even size: pair of middle chars
$ ./palindrome abcdcba && echo "true" || echo "false"
true # odd size: single middle char
$ ./palindrome abcdeba && echo "true" || echo "false"
false
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb bab bcdcb 4321 bcdcb ba' && echo "true" || echo "false"
true
$ ./palindrome 'ab bcdcb 1234 bcdcb baab bcdcb 4321 bcdcb baa' && echo "true" || echo "false"
false
$ ./palindrome && echo "true" || echo "false"
Segmentation fault (core dumped) # main doesn't check argc
不区分大小写:
除了字节反转一个块,在比较之前强制两个块的字母字符为小写。使用 ASCII,这很容易。我链接的那个答案甚至有一个大小写翻转的 SSE 向量实现,可以对其进行修改以仅将所有字母字符强制为相同的大小写(或者将最后一个替换为pxor
,por
或者使用与比较结果的混合作为控件,而不是pandn
/ por
.)
展开:
除了使用位移([rsi+8]
、、[rdi-8]
等)来保存加/减指令外,您还可以通过 ORing 或 ANDing 一些比较结果一起减少分支的数量。这可能会或可能不会保存任何东西。在完全不受部分寄存器合并减速的 CPU (例如 AMD)上,这可能是一个胜利:
.palin_loop:
xor eax,eax ; inside the loop to break the false dependency
xor edx,edx
movbe ... / cmp ... ; movbe is a load and endian swap in one insn. On Haswell it just saves code-size, not uops. On Silvermont it's a win
setne dl
movbe rcx, [rsi+rdi-16] / cmp [rsi+8], rcx
setne dh
shl edx, 16 ; DH merging takes a cycle on Haswell
... repeat setting dl/dh so edx holds 4 results from four 8B checks
movbe ... / cmp ...
setne al
movbe ... / cmp ...
setne ah
shl eax, 16
... repeat setting al/ah
or eax, edx ; or unroll twice as much, and use rax,rdx
jnz .not_palin
使用 SSE,您已经在整数 reg 中比较结果,pmovmskb
而不需要 a setcc
,将结果与运算结果放在一起是更大的胜利。
AVX:
看起来像一次字节版本的循环结构是一个想法,我们使用 pshufb 来反转整个向量的顺序。pmovmskb
和测试位掩码是处理矢量比较结果的一个非常标准的习惯用法。它比 快PTEST
,因为PTEST
是 2 微指令并且不能进行宏熔断。
DEFAULT rel
section .rodata
ALIGN 16
bytereverse_shuffle db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
section .text
ALIGN 16
global check_palindrome_AVX1
check_palindrome_AVX1:
cmp rdi, 32
jb check_palindrome_byte_at_a_time ; tailcall the version that handles small inputs
vmovdqa xmm5, [bytereverse_shuffle]
;;ALIGN 16 ; NASM syntax doesn't have an equivalent for the GNU assembler's align-if-padding-will-be-less-than-N-bytes .p2align 4,,8 (4 meaning 2^4 for power-of-2 align)
;; loop alignment is less important for tiny loops that should fit in the loop buffer
.palin_loop:
vmovdqu xmm0, [rsi + rdi - 16]
vpshufb xmm0, xmm0, xmm5 ; bswap
;;; for AVX2 ymm: vperm2i128 ymm0, ymm0, ymm0, 1 ; swap hi and lo lanes
vpcmpeqb xmm0, xmm0, [rsi] ; all we're gaining from AVX is folding this potentially-unaligned memory operand. Everything else works with SSE
vpmovmskb eax, xmm0
cmp eax, 0xffff ; check that all 16bits are set -> all 16bytes matched
jne .not_palin
add rsi, 16 ; pointer moves forward
sub rdi, 32 ; index counts down towards zero twice as fast
jae .palin_loop ; not ja: we still need to check 16 reversed bytes against the same 16 bytes if we exactly meet in the middle
;;; fused-domain uops: 7 for Haswell, 8 for SnB, so it should run one iteration per 2c
;;; An SSSE3 version that needs an unaligned load separate from pcmpeq will be 8 uops (or 9 on pre-Haswell)
.is_palin:
mov eax, 1
ret
.not_palin:
xor eax,eax
ret