loops - 当必须使用循环命令时，无论如何要增加循环命令的范围

Question

outerLoop:
    skipCarry:
        mov     eax, 2
        mov     inCompare, eax   ;Set our inCompare to 2
        mov     eax, outCompare 
        push    ecx ;Save status of these registers before entering the innerLoop
        push    eax
        mov     ecx, innerLoopCount

    isComposite:
        mov     eax, outCompare ;Value to be checked for composite
        mov     edx, 0  ;Make sure edx does not have a left over value
        div     inCompare   ;Divide outCompare by inCompare to see if we get a remainder of 0 in edx
        cmp     edx, 0
        jne     skipPrint   ;If the remainder is not 0, we have not proven that the number in question is composite, so do not print
        ;Print out a composite value if its been proven
        mov     eax, outCompare
        call    WriteDec
        mov     edx, OFFSET spaces
        call    WriteString     ;Write spaces between the composites
        mov     ebx, writeCounter
        inc     ebx     ;This will help to make sure we start a new line after writing 10 composites
        mov     writeCounter, ebx
        cmp     ebx, 10     ;If counter is at 10, we start a new line
        jne     exitInnerLoop   ;exit the inner loop because we found that the number in question is composite
        call    CrLf
        mov     writeCounter, 0
        jmp     exitInnerLoop       ;exit the inner loop because we found that the number in question is composite

        skipPrint:      ;Jumped to if the number in question was not confirmed to be composite

        ;This is to continue testing for composite of the number in question by incrementing the divisor
        mov     ebx, inCompare
        inc     ebx
        mov     eax, outCompare
        cmp     eax, ebx
        je      exitInnerLoop
        mov     inCompare, ebx

        skipIncrement:  ;Will be jumped to if there are no new divisors to be tested

        loop    isComposite     ;Retest with new divisior if there is a possibility
        exitInnerLoop:

    pop     eax     ;restore outer loop status
    pop     ecx
    inc     eax     ;Next number to check for being composite
    mov     outCompare, eax
    loop    outerLoop

我收到错误代码，说跳转太长了 5 个字节，指的是外循环。我知道只使用 jmp 命令并自己跟踪计数器会更简单，但我的家庭作业要求使用循环来代替。有没有办法扩展循环的范围，或者有没有办法将循环中的内容缩小 5 个字节，而我没有看到？

循环应该检查一个数字是否是复合的，只是为了知道功能。

score 0 · Accepted Answer

不，loop/loopne每个只有一种编码，使用 SHORT（有符号 8 位rel8）相对位移。

x86 自 386 以来具有jcc rel32aka NEAR 条件分支以及jcc rel8. 如果你需要跳得更远rel8，loop对于小代码大小是错误的选择。 dec ecx/jnz基本上是等效的，除了它破坏了一些标志。

代码大小是您应该使用的唯一原因之一loop；除了最近的 AMD 之外，一切都很慢）。（速度 ~= 8086 上的代码大小，所以它在那里很有用。后来的 CPU 故意让它变慢是为了与驱动程序中的（坏？）延迟循环兼容。

如果你真的想用跳得更远loop，你可以这样做：

    loop  .stay_in_loop

    jmp   .leave_loop
.stay_in_loop:
    jmp .far_away
.leave_loop:

该.stay_in_loop块可以在函数结束之后，或者在 127B 内从未到达的其他地方，这可以让你省略jmp .leave_loop.

你也可以“作弊”，把一些循环体放到外线。例如，您可以将Print块移动到函数结束之后（或开始之前），然后jcc移动到它并jmp返回到循环内部。

使用两个嵌套loop指令只是脑残，但似乎你需要使用它。push/pop 来保存/恢复外部循环ecx不是必需的；您可以使用另一个寄存器mov ecx, esi或其他东西。无论哪种方式都很糟糕，因为推动或mov esi, ecx必须处于循环的顶部。除非您dec esi自己伪造它，而不是复制loop. 这使代码更清晰，但可能被认为是作弊/违背了要求的目的。:P

也许你的任务的真正目标是让你优化代码大小，以便短跳转可以达到. 在您的情况下，这绝对是整体代码质量的正确方法。

初学者代码效率低下是正常的，因此这里有许多简单的局部优化。egxor edx,edx比 5 byte 多 2 字节（并且更快）mov edx, 0。用于test edx,edx与零进行比较。它更小更好，并设置相同的标志cmp edx,0

此外，将变量保存在寄存器中。类似mov writeCounter, 0汇编的指令mov r/m32, imm32，使用 disp32 寻址模式。所以这是 8 个字节的地址和数据，加上 2 个字节的操作码 + ModR/M。每个绝对地址而不是寄存器操作数都需要 4 个字节。

如果您需要在循环内保存更多代码大小，您可以将内容复制到堆栈并使用[ebp+disp8]寻址模式访问它们（ModR/M 之外的 1 个字节而不是 4 个）。

要释放更多寄存器以在循环中使用，请在函数的开始/结束处推送/弹出它们。

问题中代码的清理版本

外部的位移loop是0xc9, 或 -55。所以我的版本不到你版本的一半。ResetWriteCounter（循环外有 12 个字节的代码。）

    push   ebx
    push   ebp       ; we're NOT using it as a frame pointer
    push   edi
    push   esi


;;; Important to keep in registers
; edi = inCompare = trial divisor
; ebx = outCompare = prime/composite candidate

;; nice to keep in registers
; ebp = WriteCounter  (but done as a down-counter with dec instead of inc/cmp)
; esi = outer loop counter

;; these variable names are kinda bad, IMO.
;; candidate / divisor would be better names.

; ecx = inner loop counter
; eax,edx = scratch for DIV


    mov    ebx, outCompare    ; first composite candidate

    mov    ebp, 10            ; down-counter to trigger newlines
    push   '  '          ; two spaces followed by 2 zero bytes (i.e. null-terminated string)
   ; Use mov edx, esp instead of mov edx, OFFSET spaces in the print block

   ;;; note that code before this isn't part of the size that LOOP has to jump over.

outerLoop:
        test    bl, 1
        jz      Print    ;; multiple of 2
        ;; inner loop now only has to check the odd divisors

        mov     edi, 3    ; first trial divisor = first odd prime (other than 1)
        mov     ecx, innerLoopCount     ; ideally set to sqrt(ebx), e.g. cvtsi2ss xmm0, ebx / sqrtss xmm0,xmm0 / cvtss2si xmm0, ecx might be correct :P
                   ; could instead use ecx = ebx directly if you don't care about speed, so try divisors all the way up to the candidate.

    ;; inner loop
    isComposite:
        mov     eax, ebx         ; edx:eax = dividend = candidate
        xor     edx, edx
        div     edi              ; trial division

        add     edi, 2           ; next trial divisor = next odd number
        test    edx, edx         ; equivalent to cmp edx,0 to check remainder
        jz      Print            ; you could maybe fold this into a LOOPNZ and check ZF after the loop
   ;If the remainder is not 0, we have not proven that the number in question is composite, so do not print

        ;;cmp     edi, ebx        ;  divisor, candidate
        ;;jae     exitInnerLoop   ; what's the point of this?  isn't ecx set so that loop falls through after this many iterations?

        loop   isComposite     ;Retest with new divisor if there is a possibility

        jmp     NoPrint        ; else it was prime

;;; I put this outside the inner loop in the first place instead of needing to multiple jumps out of this to exitInnerLoop
    Print:   ;Print out a composite value if its been proven
        mov     eax, ebx
        call    WriteDec

        dec     ebp              ; new line when the down-counter reaches zero
        jz      ResetWriteCounter

        ; print spaces only if not printing a newline
        mov     edx, esp        ; OFFSET spaces;  Use stack memory we pushed earlier
        call    WriteString     ;Write spaces between the composites

    ;; tail of outer loop.  Prime numbers jump here without printing.
    exitInnerLoop:
    NoPrint:
        inc     ebx         ;Next number to check for being composite
                            ; TODO: increment by 2 and don't even check the even numbers, just print them after every odd prime/composite

        mov     ecx, esi    ; ecx = outer loop counter for the LOOP insn
        dec     esi         ; update outer counter instead of having  mov esi, ecx at the top of the loop
        loop    outerLoop

    add    esp, 4    ; clear the '  ' we pushed earlier.  pop esi  is smaller and about as efficiently

    pop    esi
    pop    edi
    pop    ebp
    pop    ebx
    ret

  ; after the end of the function is a good place for infrequently-use blocks that you jump to and then jump back.
 ResetWriteCounter:
        mov     ebp, 10         ; reset counter
        ; start a new line after writing 10 composites
        call    CrLf
        ; could have done this branchlessly by selecting a string of '  ',0 or 13,10,0 based on ebp being zero
        ; but doing the ebp update branchlessly too is worse than this highly predictable branch.  Maybe if the count was a power of 2, DEC/AND

        jmp  NoPrint          ; back where we came from, skipping the write spaces
       ; alternative: duplicate inc / mov / dec / loop block here

在分支结构方面，您可以做的更多。例如，我想消除jmp NoPrint掉出循环后的情况。这似乎很笨拙。将Print块放在其他地方就可以了，但是 Print 将不得不跳而不是跌倒。

TODO：ecx用作试验除数。不过，倒数可能比倒数慢得多，因为小的素数更常见。也需要避免1。

我打算只应用一些本地优化，但是当我完成对您的分支结构的理解时，我已经很开心地在代码大小上打高尔夫球了。高尔夫代码中的高尔夫 = 最少的字节获胜。

请参阅我的一些x86 机器代码高尔夫答案以了解代码大小的技巧（包括以牺牲速度为代价，有时甚至很多）：

8 字节循环中的 GCD
色度键在 13 个字节中混合两个图像（数组）。
Extreme Fibonacci：Fib(10^9) 105 字节 x86 32 位代码（使用 Linux 系统调用）中的adc前 1000 位数字，使用 10 次方小数肢体进行扩展精度，以便于截断以仅保留最重要的数字。这个很简洁，因为还有性能要求，所以内部循环针对速度进行了优化。（整个过程在我的 i7-6700k 上运行大约 70 秒。）
对数刻度适用于退出者：strlen 和 memset 用rep字符串替换 N 字符的字符串，用 base^N 填充字符；21 个字节。
做n和n^3有同组的十进制数字，40字节的机器码。（bts用于在寄存器中构建位图）。）

loops - 当必须使用循环命令时，无论如何要增加循环命令的范围

1 回答 1

问题中代码的清理版本

Related

Reference