performance - 为什么浮点除法慢？

Question

算法中进行浮点除法的步骤是什么？

为什么结果比乘法慢？

它的完成方式与我们手动除法的方式相同吗？通过重复除数，减去结果获得余数，再次对齐数字并继续直到余数小于特定值？

另外，如果我们不这样做，为什么我们会提高性能？

a = b / c

我们的确是

d = 1 / c
a = b * d

?

编辑：基本上我问是因为有人要求我根据权重分配在竞争者之间分配一个值。我用整数完成了所有这些操作，后来被要求转换为浮点数，这导致性能下降。我只是想知道 C 或 C++ 如何执行这些会导致缓慢的操作。

score 23 · Accepted Answer

FPU division often basically uses Newton-Raphson (or some other algorithm) to get a reciprocal then multiplies by that reciprocal. That's why the reciprocal operation is slightly faster than the general division operation.

This HP paper (which is actually more understandable than most papers I come across talking about Newton-Raphson) has this to say about floating point division:

Floating point division and square root take considerably longer to compute than addition and multiplication. The latter two are computed directly while the former are usually computed with an iterative algorithm. The most common approach is to use a division-free Newton-Raphson iteration to get an approximation to the reciprocal of the denominator (division) or the reciprocal square root, and then multiply by the numerator (division) or input argument (square root).

score 18 · Accepted Answer

From a hardware point of view division is a iterative algorithm, and the time it takes is proportional to the number of bits. The fastest division that is currently around uses the radix4 algorithm which generates 4 bit of result per iteration. For a 32 bit divide you need 8 steps at least.

Multiplication can be done in parallel to a certain degree. Without going into detail you can break up a large multiplication into several smaller, independent ones. These multiplications can again be broken down until you're at a bit-level, or you stop earlier and use a small lookup-table in hardware. This makes the multiplication hardware heavy from a silicon real estate point of view but very fast as well. It's the classic size/speed tradeoff.

You need log2 steps to combine the parallel computed results, so a 32 bit multiply need 5 logical steps (if you go down to the minimum). Fortunately these 5 steps are a good deal simpler than the division steps (it's just additions). That means in practice multiplies are even faster.

score 6 · Accepted Answer

As described in the Wikipedia article Division algorithm, there are two main aproaches to division in computers:

Slow Division

Uses the following recurrence and finds one digit per iteration: partialRemainder[j+1] = radix * partialRemainder[j] - quotientDigit[n-(j+1)]*denominator

Fast Division

Starts with an estimation and converges on the quotient. How accurate you are depends on the number of iterations.

Newton-Raphson division (very briefly):

Calculate estimate of the reciprocal.
Compute more accurate estimates of the reciprocal.
Compute quotient by multiplying the dividend by the reciprocal.

score 1 · Accepted Answer

你不会通过这样做来获得性能

d = 1 / c
a = b * d

你可能的意思是：

d = 1 / c
a1 = b1 * d
a2 = b2 * d

这样，除法只进行一次。

除法本身比乘法慢，但是，我不知道细节。基本原因是，类似于 sin 或 sqrt 等函数，它只是在数学上更复杂。IIRC，乘法在平均 CPU 上大约需要 10 个周期，而除法大约需要 50 或更多。

John Mulder 很好地解释了它是如何完成的。

score 0 · Accepted Answer

Think of the hardware involved, and you'll understand a lot better why it takes so much longer to divide than multiply. Both operations are done down at the Floating Point Unit (FPU) level, and even in the world of integral ALUs, the division circuit is a far busier place than a multiplication circuit. I would suspect this is only more painful in the world of floating point, as now the data isn't just least to most significant digit ordered, but is instead ordered by the IEEE 754 standard.

As for the round off, it's really about wherever the signals traveling between the gates get soldered to ground; where that happens, you lose digits. Not rounding, so much as truncation.

Or were you asking about simulating floating point arithmetic using just integers?

score 0 · Accepted Answer

Float division is not much slower than integer division, but the compiler may be unable to do the same optimizations.

For example the compiler can replace integer division between 3 with a multiplication and a binary shift. Also it can replace float division between 2.0 with a multipliation by 0.5 but it cannot replace division between 3.0 with a multiplication by 1/3.0 as 1/3.0 cannot be represented exactly using binary numbers, therefore rounding errors may change the result of the division.
As the compiler doesn't know how sensitive is your application to rounding errors (say you were doing a weather simulation, see Butterfly effect )it cannot do the optimization.

performance - 为什么浮点除法慢？

6 回答 6

Slow Division

Fast Division

Related

Reference