不,loop
/loopne
每个只有一种编码,使用 SHORT(有符号 8 位rel8
)相对位移。
x86 自 386 以来具有jcc rel32
aka NEAR 条件分支以及jcc rel8
. 如果你需要跳得更远rel8
,loop
对于小代码大小是错误的选择。 dec ecx
/jnz
基本上是等效的,除了它破坏了一些标志。
代码大小是您应该使用的唯一原因之一loop
;除了最近的 AMD 之外,一切都很慢)。(速度 ~= 8086 上的代码大小,所以它在那里很有用。后来的 CPU 故意让它变慢是为了与驱动程序中的(坏?)延迟循环兼容。
如果你真的想用 跳得更远loop
,你可以这样做:
loop .stay_in_loop
jmp .leave_loop
.stay_in_loop:
jmp .far_away
.leave_loop:
该.stay_in_loop
块可以在函数结束之后,或者在 127B 内从未到达的其他地方,这可以让你省略jmp .leave_loop
.
你也可以“作弊”,把一些循环体放到外线。例如,您可以将Print
块移动到函数结束之后(或开始之前),然后jcc
移动到它并jmp
返回到循环内部。
使用两个嵌套loop
指令只是脑残,但似乎你需要使用它。push/pop 来保存/恢复外部循环ecx
不是必需的;您可以使用另一个寄存器mov ecx, esi
或其他东西。无论哪种方式都很糟糕,因为推动或mov esi, ecx
必须处于循环的顶部。除非您dec esi
自己伪造它,而不是复制loop
. 这使代码更清晰,但可能被认为是作弊/违背了要求的目的。:P
也许你的任务的真正目标是让你优化代码大小,以便短跳转可以达到. 在您的情况下,这绝对是整体代码质量的正确方法。
初学者代码效率低下是正常的,因此这里有许多简单的局部优化。egxor edx,edx
比 5 byte 多 2 字节(并且更快)mov edx, 0
。用于test edx,edx
与零进行比较。 它更小更好,并设置相同的标志cmp edx,0
此外,将变量保存在寄存器中。类似mov writeCounter, 0
汇编的指令mov r/m32, imm32
,使用 disp32 寻址模式。所以这是 8 个字节的地址和数据,加上 2 个字节的操作码 + ModR/M。每个绝对地址而不是寄存器操作数都需要 4 个字节。
如果您需要在循环内保存更多代码大小,您可以将内容复制到堆栈并使用[ebp+disp8]
寻址模式访问它们(ModR/M 之外的 1 个字节而不是 4 个)。
要释放更多寄存器以在循环中使用,请在函数的开始/结束处推送/弹出它们。
问题中代码的清理版本
外部的位移loop
是0xc9
, 或 -55。所以我的版本不到你版本的一半。ResetWriteCounter
(循环外有 12 个字节的代码。)
push ebx
push ebp ; we're NOT using it as a frame pointer
push edi
push esi
;;; Important to keep in registers
; edi = inCompare = trial divisor
; ebx = outCompare = prime/composite candidate
;; nice to keep in registers
; ebp = WriteCounter (but done as a down-counter with dec instead of inc/cmp)
; esi = outer loop counter
;; these variable names are kinda bad, IMO.
;; candidate / divisor would be better names.
; ecx = inner loop counter
; eax,edx = scratch for DIV
mov ebx, outCompare ; first composite candidate
mov ebp, 10 ; down-counter to trigger newlines
push ' ' ; two spaces followed by 2 zero bytes (i.e. null-terminated string)
; Use mov edx, esp instead of mov edx, OFFSET spaces in the print block
;;; note that code before this isn't part of the size that LOOP has to jump over.
outerLoop:
test bl, 1
jz Print ;; multiple of 2
;; inner loop now only has to check the odd divisors
mov edi, 3 ; first trial divisor = first odd prime (other than 1)
mov ecx, innerLoopCount ; ideally set to sqrt(ebx), e.g. cvtsi2ss xmm0, ebx / sqrtss xmm0,xmm0 / cvtss2si xmm0, ecx might be correct :P
; could instead use ecx = ebx directly if you don't care about speed, so try divisors all the way up to the candidate.
;; inner loop
isComposite:
mov eax, ebx ; edx:eax = dividend = candidate
xor edx, edx
div edi ; trial division
add edi, 2 ; next trial divisor = next odd number
test edx, edx ; equivalent to cmp edx,0 to check remainder
jz Print ; you could maybe fold this into a LOOPNZ and check ZF after the loop
;If the remainder is not 0, we have not proven that the number in question is composite, so do not print
;;cmp edi, ebx ; divisor, candidate
;;jae exitInnerLoop ; what's the point of this? isn't ecx set so that loop falls through after this many iterations?
loop isComposite ;Retest with new divisor if there is a possibility
jmp NoPrint ; else it was prime
;;; I put this outside the inner loop in the first place instead of needing to multiple jumps out of this to exitInnerLoop
Print: ;Print out a composite value if its been proven
mov eax, ebx
call WriteDec
dec ebp ; new line when the down-counter reaches zero
jz ResetWriteCounter
; print spaces only if not printing a newline
mov edx, esp ; OFFSET spaces; Use stack memory we pushed earlier
call WriteString ;Write spaces between the composites
;; tail of outer loop. Prime numbers jump here without printing.
exitInnerLoop:
NoPrint:
inc ebx ;Next number to check for being composite
; TODO: increment by 2 and don't even check the even numbers, just print them after every odd prime/composite
mov ecx, esi ; ecx = outer loop counter for the LOOP insn
dec esi ; update outer counter instead of having mov esi, ecx at the top of the loop
loop outerLoop
add esp, 4 ; clear the ' ' we pushed earlier. pop esi is smaller and about as efficiently
pop esi
pop edi
pop ebp
pop ebx
ret
; after the end of the function is a good place for infrequently-use blocks that you jump to and then jump back.
ResetWriteCounter:
mov ebp, 10 ; reset counter
; start a new line after writing 10 composites
call CrLf
; could have done this branchlessly by selecting a string of ' ',0 or 13,10,0 based on ebp being zero
; but doing the ebp update branchlessly too is worse than this highly predictable branch. Maybe if the count was a power of 2, DEC/AND
jmp NoPrint ; back where we came from, skipping the write spaces
; alternative: duplicate inc / mov / dec / loop block here
在分支结构方面,您可以做的更多。例如,我想消除jmp NoPrint
掉出循环后的情况。这似乎很笨拙。将Print
块放在其他地方就可以了,但是 Print 将不得不跳而不是跌倒。
TODO:ecx
用作试验除数。不过,倒数可能比倒数慢得多,因为小的素数更常见。也需要避免1
。
我打算只应用一些本地优化,但是当我完成对您的分支结构的理解时,我已经很开心地在代码大小上打高尔夫球了。高尔夫代码中的高尔夫 = 最少的字节获胜。
请参阅我的一些x86 机器代码高尔夫答案以了解代码大小的技巧(包括以牺牲速度为代价,有时甚至很多):