c++ - Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

Question

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed?

score 7 · Accepted Answer

不。在现代架构中，Karatsuba 击败教科书乘法的交叉点通常在 8 到 24 个机器字之间（例如，在 x86_64 上在 512 到 1536 位之间）。对于固定大小，阈值位于该范围的较小端，并且新的 ADCX/ADOX 指令可能会将其进一步用于标量代码，但 64x64 仍然太小而无法从 Karatsuba 中受益。

score 4 · Accepted Answer

AVX2 极不可能在一条指令中击败从mulx64bx64b到 128b 的指令。我知道有一个例外是使用浮点 FFT 进行大乘法。

但是，如果您不需要 64bx64b 到 128b ，则可以考虑使用double-double 算术53bx53b 到 106b 。

四个 53 位数字相乘a得到b四个 106 位数字只需要两条指令：

__m256 p = _mm256_mul_pd(a,b);
__m256 e = _mm256_fmsub_pd(a,b,p);

这在两条指令中提供了四个 106 位数字，而使用mulx.

score 3 · Accepted Answer

不尝试就很难判断，但只使用 AMD64 MUL 指令可能会更快，它支持 64x64=128，吞吐量与大多数 AVX2 指令相同（但未矢量化）。缺点是如果操作数在 YMM 寄存器中，则需要加载到常规寄存器。这会给LOAD + MUL + STORE单个 64x64=128 类似的东西。

如果您可以在 AVX2 中对 Karatsuba 进行矢量化，请尝试 AVX2MUL并查看哪个更快。如果你不能矢量化，singleMUL可能会更快。如果您可以删除加载并存储到常规寄存器，那么单MUL人肯定会更快。

MUL和 AVX2 指令都可以在内存中具有相同吞吐量的操作数，这可能有助于为MUL.

c++ - Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

3 回答 3

Related

Reference