optimization - 想了解浮点指令估算的负载

Question

一开始，这可能是一种部分讨论部分解决的问题。无意冒犯那里的任何人。

我已经用 64 位汇编编写了算法来生成基于 MT Prime 的 64 位随机数生成器。该生成器函数需要被调用 80 亿次才能填充大小为 2048x2048x2048 的数组，并生成 1..small_value 之间的随机数（通常为 32）

现在我有两个下一步的可能性：

(a) 继续生成数字，与限制 [1..32] 进行比较，并丢弃那些不属于范围内的数字。此逻辑的运行时间为 181,817 毫秒，通过调用 clock() 函数测量。

(b) 取 RAX 中输出的 64 位随机数，并使用 FPU 将其缩放到 [0..1] 之间，然后将其缩放到所需范围 [1..32] 的代码序列为以下：

 mov word ptr initialize_random_number_scaling,dx
 fnclex             ; clears status flag
 call generate_fp_random_number ; returns a random number in ST(0) between [0..1]
 fimul word ptr initialize_random_number_scaling ; Mults ST(0) & stores back in ST(0)
 mov word ptr initialize_random_number_base,ax ; Saves base to a memory
 fiadd word ptr initialize_random_number_base  ; adds the base to the scaled fp number
 frndint                            ; rounds off the ST(0)
 fist word ptr initialize_random_number_result ; and stores this number to result.
 ffree st(0)               ; releases ST(0)
 fincstp                       ; Logically pops the FPU
 mov ax, word ptr initialize_random_number_result       ; and saves it to AX

generate_fp_random_number 中的说明如下：

 shl rax,1  ; RAX gets the original 64 bit random number using MT prime algorithm
 shr ax,1   ; Clear top bit
 mov qword ptr random_number_generator_act_number,rax ; Save the number in memory as we cannot move to ST(0) a number from register
 fild   qword ptr random_number_generator_max_number    ; Load 0x7FFFFFFFFFFFFFFFH
 fild   qword ptr random_number_generator_act_number    ; Load our number
 fdiv   st(0),st(1) ; We return the value through ST(0) itself, divide our random number with max possible number
 fabs
 ffree st(1)    ; release the st(1)
 fld1           ; push to top of stack a 1.0
 fcomip st(0), st(1)    ; compares our number in ST(1) with ST(0) and sets CF.
 jc generate_fp_random_get_next_no ; if ST(0) (=1.0) < ST(1) (our no), we need a new no
 fldz               ; push to top of stack a 0.0
 fcomip st(0),st(1) ; if ST(0) (=0.0) >ST(1) (our no) clears CF
 jnc generate_fp_random_get_next_no ; so if the number is above zero the CF will be set
 fclex

问题是，仅仅通过添加这些指令，运行时间就跳到了惊人的 5,633,963 毫秒！我已经使用 xmm 寄存器作为替代方案编写了上述内容，差异绝对是微不足道的。（5,633,703 毫秒）。

有人会指导我这些附加说明对总运行时间的影响程度吗？FPU真的这么慢吗？还是我错过了一个技巧？一如既往，欢迎所有想法，并感谢您的时间和努力。

环境：在 VS 2012 Express 环境中调试的 Intel 2700K CPU 上的 Windows 7 64 位超频至 4.4 GHz 16 GB RAM

score 0 · Accepted Answer

“mov word ptr initialize_random_number_base,ax ; 将基数保存到内存中”

如果您想要最大速度，您必须了解如何分离写指令并将数据写入不同的内存部分

在缓存的同一区域重写数据会产生“自修改代码”的情况

您的编译器可能会这样做，也可能不会。您需要知道这一点，因为未经优化的汇编代码运行速度要慢 10 到 50 倍

“所有现代处理器都会缓存代码和数据内存以提高效率。如果将数据写入与执行代码相同的内存块，则汇编语言代码的性能可能会受到严重影响，因为它可能会导致 CPU 反复重新加载“

http://www.bbcbasic.co.uk/bbcwin/manual/bbcwina.html#cache

score 0 · Accepted Answer

你的代码中有很多东西，我看不出有什么理由。如果有原因，请随时纠正我，但除此之外，这是我的替代方案：

对于 generate_fp_random_number

shl rax, 1
shr rax, 1
mov qword ptr act_number, rax
fild qword ptr max_number
fild qword ptr act_number
fdivrp   ; divide actual by max and pop
; and that's it. It's already within bounds.
; It can't be outside [0, 1] by construction.
; It can't be < 0 because we just divided two positive number,
; and it can't be > 1 because we divided by the max it could be

对于另一件事：

mov word ptr scaling, dx
mov word ptr base, ax
call generate_fp_random_number
fimul word ptr scaling
fiadd word ptr base
fistp word ptr result  ; just save that thing
mov ax, word ptr result
; the default rounding mode is round to nearest,
; so the slow frndint is unnecessary

还要注意完全没有ffree's 等。通过弹出正确的指令，这一切都解决了。它通常会。

optimization - 想了解浮点指令估算的负载

2 回答 2

Related

Reference