我正在处理应用程序的一个非常低级别的部分,其中性能至关重要。
在调查生成的程序集时,我注意到以下说明:
lea eax,[edx*8+8]
我习惯于在使用内存引用时看到加法(例如 [edx+4]),但这是我第一次看到乘法。
- 这是否意味着 x86 处理器可以在 lea 指令中执行简单的乘法运算?
- 这种乘法是否会影响执行指令所需的周期数?
- 乘法是否仅限于 2 的幂(我假设是这种情况)?
提前致谢。
我正在处理应用程序的一个非常低级别的部分,其中性能至关重要。
在调查生成的程序集时,我注意到以下说明:
lea eax,[edx*8+8]
我习惯于在使用内存引用时看到加法(例如 [edx+4]),但这是我第一次看到乘法。
提前致谢。
要扩展我的评论并回答其余问题...
是的,它仅限于二的幂。(特别是 2、4 和 8)所以不需要乘数,因为它只是一个转变。它的重点是从索引变量和指针快速生成地址 - 其中数据类型是简单的 2、4 或 8 字节字。(尽管它也经常被滥用于其他用途。)
至于所需的周期数:根据Agner Fog 的表格,看起来该lea
指令在某些机器上是恒定的,而在其他机器上是可变的。
在 Sandy Bridge 上,如果它是“复杂的或相对撕裂的”,则会受到 2 个循环的惩罚。但它并没有说明“复杂”是什么意思……所以我们只能猜测,除非你做一个基准测试。
实际上,这不是特定于lea
指令的内容。
This type of addressing is called Scaled Addressing Mode
. The multiplication is achieved by a bit shift, which is trivial:
You could do 'scaled addressing' with a mov
too, for example (note that this is not the same operation, the only similarity is the fact that ebx*4
represents an address multiplication):
mov edx, [esi+4*ebx]
(source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html#memory)
For a more complete listing, see this Intel document. Table 2-3 shows that a scaling of 2, 4, or 8 is allowed. Nothing else.
Latency (in terms of number of cycles): I don't think this should be affected at all. A shift is a matter of connections, and selecting between three possible shifts is the matter of 1 multiplexer worth of delay.
To expand on your last question:
Is the multiplication limited to powers of 2 (I would assume this is the case)?
Note that you get the result of base + scale * index
, so while scale
has to be 1, 2, 4 or 8 (the size of x86 integer datatypes), you can get the equivalent of a multiplication by some different constants by using the same register as base
and index
, e.g.:
lea eax, [eax*4 + eax] ; multiply by 5
This is used by the compiler to do strength reduction, e.g: for a multiplication by 100, depending on compiler options (target CPU model, optimization options), you may get:
lea (%edx,%edx,4),%eax ; eax = orig_edx * 5
lea (%eax,%eax,4),%eax ; eax = eax * 5 = orig_edx * 25
shl $0x2,%eax ; eax = eax * 4 = orig_edx * 100