math - 浮点数学被破坏了吗？

Question

考虑以下代码：

0.1 + 0.2 == 0.3  ->  false

0.1 + 0.2         ->  0.30000000000000004

为什么会出现这些错误？

score 2682 · Accepted Answer

二进制浮点数学是这样的。在大多数编程语言中，它基于IEEE 754 标准。问题的症结在于，数字以这种格式表示为整数乘以 2 的幂。不能精确表示分母不是 2 次方的有理数（例如0.1，即）。1/10

因为0.1在标准binary64格式中，表示可以完全写成

0.1000000000000000055511151231257827021181583404541015625十进制，或
0x1.999999999999ap-4在C99 hexfloat 表示法中。

相反，有理数0.1，即1/10，可以完全写为

0.1十进制，或
0x1.99999999999999...p-4在 C99 hexfloat 表示法的类似物中，其中...表示 9 的无休止序列。

程序中的常数0.2和0.3也将近似于它们的真实值。碰巧最接近double的0.2大于有理数0.2，但最接近double的0.3小于有理数0.3。0.1和的总和0.2最终大于有理数0.3，因此与代码中的常数不一致。

浮点算术问题的一个相当全面的处理是每个计算机科学家都应该知道的关于浮点算术的知识。有关更易于理解的解释，请参阅floating-point-gui.de。

旁注：所有位置（base-N）数字系统都精确地共享这个问题

普通的旧十进制（以 10 为底）数字也有同样的问题，这就是为什么像 1/3 这样的数字最终会变成 0.333333333...

您刚刚偶然发现了一个数字 (3/10)，它恰好很容易用十进制系统表示，但不适合二进制系统。它也是双向的（在某种程度上）：1/16 是十进制的丑数（0.0625），但在二进制中它看起来就像十进制中的万分之一一样整洁（0.0001）** - 如果我们在在我们的日常生活中使用以 2 为底的数字系统的习惯，你甚至会看到这个数字并本能地理解你可以通过减半来达到那里，一次又一次地减半。

** 当然，浮点数在内存中的存储方式并不完全正确（它们使用一种科学记数法）。然而，它确实说明了二进制浮点精度错误往往会突然出现的一点，因为我们通常感兴趣的“现实世界”数字通常是十的幂 - 但仅仅是因为我们使用十进制数字系统 -今天。这也是为什么我们会说 71% 而不是“每 7 个中有 5 个”之类的东西（71% 是一个近似值，因为 5/7 不能用任何十进制数精确表示）。

所以不：二进制浮点数没有被破坏，它们只是碰巧和其他所有基于 N 的数字系统一样不完美:)

旁注：在编程中使用浮点数

在实践中，这个精度问题意味着您需要使用舍入函数将浮点数四舍五入到您感兴趣的小数位，然后再显示它们。

您还需要用允许一定容差的比较替换相等测试，这意味着：

不要做_if (x == y) { ... }

而是做if (abs(x - y) < myToleranceValue) { ... }.

其中abs是绝对值。myToleranceValue需要为您的特定应用程序选择 - 这与您准备允许多少“摆动空间”以及您要比较的最大数字可能是多少（由于精度问题的损失）有很大关系）。请注意您选择的语言中的“epsilon”样式常量。这些不能用作公差值。

score 673 · Accepted Answer

硬件设计师的观点

我相信我应该为此添加硬件设计师的观点，因为我设计和构建浮点硬件。了解错误的来源可能有助于了解软件中发生的情况，最终，我希望这有助于解释浮点错误发生的原因并且似乎随着时间的推移而累积。

一、概述

从工程的角度来看，大多数浮点运算都会有一些错误，因为进行浮点计算的硬件只需要在最后一个单位的误差小于一半。因此，许多硬件将停止在一个精度上，该精度只需要在单个操作的最后一个位置产生小于一半的误差，这在浮点除法中尤其成问题。什么构成单个操作取决于该单元需要多少个操作数。大多数情况下，它是两个，但有些单元需要 3 个或更多操作数。因此，不能保证重复的操作会导致理想的错误，因为错误会随着时间的推移而累积。

2. 标准

大多数处理器遵循IEEE-754标准，但有些使用非规范化或不同的标准。例如，IEEE-754 中有一种非规范化模式，它允许以牺牲精度为代价来表示非常小的浮点数。然而，下面将介绍 IEEE-754 的标准化模式，这是典型的操作模式。

在 IEEE-754 标准中，允许硬件设计人员使用任何 error/epsilon 值，只要它小于最后一个单位的二分之一，并且结果只需小于最后一个单位的二分之一一个操作的地方。这就解释了为什么当有重复操作时，错误会加起来。对于 IEEE-754 双精度，这是第 54 位，因为 53 位用于表示浮点数的数字部分（归一化），也称为尾数（例如 5.3e5 中的 5.3）。下一节将更详细地介绍各种浮点运算中硬件错误的原因。

三、除法舍入误差的原因

浮点除法错误的主要原因是用于计算商的除法算法。大多数计算机系统使用乘以逆来计算除法，主要在Z=X/Y,Z = X * (1/Y). 除法是迭代计算的，即每个周期计算商的一些位，直到达到所需的精度，对于 IEEE-754 来说，这是最后一位误差小于一个单位的任何东西。Y的倒数表（1/Y）在慢除法中称为商选择表（QST），商选择表的位大小通常是基数的宽度，或位数在每次迭代中计算的商，加上一些保护位。对于 IEEE-754 标准，双精度（64 位），它将是除法器的基数加上几个保护位 k 的大小，其中k>=2. 因此，例如，一次计算 2 位商（基数 4）的除法器的典型商选择表将是2+2= 4位（加上一些可选位）。

3.1 除法舍入误差：倒数的近似

商选择表中的倒数取决于除法：慢除法如 SRT 除法，或快速除法如 Goldschmidt 除法；每个条目都根据除法算法进行修改，以尝试产生尽可能低的错误。但是，无论如何，所有倒数都是近似值的实际倒数并引入一些误差因素。慢除法和快除法都迭代计算商，即每一步计算商的一些位数，然后从被除数中减去结果，除法器重复这些步骤，直到误差小于一的二分之一单位排在最后。慢除法方法在每个步骤中计算商的固定位数并且通常构建成本较低，而快速除法方法计算每个步骤的可变位数并且通常构建成本更高。除法方法最重要的部分是它们中的大多数依赖于通过一个近似倒数的重复乘法，因此它们容易出错。

4. 其他运算中的舍入误差：截断

所有操作中舍入误差的另一个原因是 IEEE-754 允许的最终答案截断的不同模式。有截断、向零舍入、最近舍入（默认）、向下舍入和向上舍入。所有方法都会在单个操作的最后一个位置引入小于一个单位的误差元素。随着时间的推移和重复的操作，截断也会累积地增加结果错误。这种截断误差在求幂中尤其成问题，它涉及某种形式的重复乘法。

5.重复操作

由于进行浮点计算的硬件只需要在单个操作的最后一个位置产生一个误差小于一半的结果，如果不注意，误差会随着重复操作而增长。这就是为什么在需要有界误差的计算中，数学家使用诸如在 IEEE-754的最后一个位置使用舍入到最近的偶数等方法，因为随着时间的推移，错误更有可能相互抵消出，和区间算术结合IEEE 754 舍入模式的变化预测舍入误差，并纠正它们。由于与其他舍入模式相比，它的相对误差较低，舍入到最接近的偶数（在最后一位）是 IEEE-754 的默认舍入模式。

请注意，默认的舍入模式，舍入到最后一位最近的偶数，保证一次操作的最后一位的误差小于一个单位的二分之一。单独使用截断、向上舍入和向下舍入可能会导致错误大于最后一位单位的二分之一，但小于最后一位单位，因此不建议使用这些模式，除非它们是用于区间算术。

6.总结

简而言之，浮点运算出错的根本原因是硬件截断和除法倒数截断的结合。由于 IEEE-754 标准对单个操作只要求最后一位的误差小于一个单位的二分之一，因此重复操作的浮点误差会累加起来，除非得到纠正。

score 575 · Accepted Answer

它的破坏方式与您在小学时学到的十进制（以 10 为基数）符号的破坏方式完全相同，仅针对以 2 为基础的符号。

要理解，请考虑将 1/3 表示为十进制值。不可能完全做到！同样，1/10（十进制 0.1）不能以 2 进制（二进制）精确表示为“十进制”值；小数点后的重复模式永远持续下去。该值不准确，因此您无法使用普通浮点方法对其进行精确数学运算。

score 356 · Accepted Answer

这里的大多数答案都用非常枯燥的技术术语来解决这个问题。我想用普通人可以理解的方式来解决这个问题。

想象一下，您正在尝试切比萨饼。你有一个机器人披萨切割机，可以将披萨片精确地切成两半。它可以将整个披萨减半，也可以将现有切片减半，但无论如何，减半总是准确的。

那个比萨刀的动作非常精细，如果你从一整块比萨开始，然后把它减半，然后每次将最小的切片继续减半，你可以减半53 次，直到切片太小，甚至无法达到高精度的能力. 那时，您不能再将那个非常薄的切片减半，而必须按原样包含或排除它。

现在，您将如何将所有切片以这样一种方式拼凑起来，加起来相当于披萨的十分之一 (0.1) 或五分之一 (0.2)？认真想想，然后努力解决。如果您手头有一个神话般的精密比萨刀，您甚至可以尝试使用真正的比萨饼。:-)

当然，大多数有经验的程序员都知道真正的答案，那就是无论你把它们切成多细，都无法用这些切片将披萨的十分之一或五分之一拼凑起来。你可以做一个很好的近似，如果你将 0.1 的近似值与 0.2 的近似值相加，你会得到一个很好的近似值 0.3，但它仍然只是一个近似值。

对于双精度数字（这是使您可以将披萨减半53倍的精度），该数字立即少于0.1，是0.099999999999999999999999999999167327315315313259468227272727248931893155555555555555555555555555555555555555555555太平洋。后者比前者更接近 0.1，因此数字解析器将在输入 0.1 的情况下支持后者。

（这两个数字之间的差异是我们必须决定包含的“最小切片”，它会引入向上的偏差，或者排除，它会引入向下的偏差。最小切片的技术术语是ulp。）

在 0.2 的情况下，数字都是相同的，只是放大了 2 倍。同样，我们支持略高于 0.2 的值。

请注意，在这两种情况下，0.1 和 0.2 的近似值都有轻微的向上偏差。如果我们添加足够多的这些偏差，它们会使数字越来越远离我们想要的，事实上，在 0.1 + 0.2 的情况下，偏差足够高，结果数字不再是最接近的数字为 0.3。

特别地，0.1 + 0.2是真的0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125，而最接近0.3数量实际上是0.299999999999999988897769753748434595763683319091796875。

PS 一些编程语言还提供比萨刀，可以将切片精确地分成十分之一。虽然这样的比萨刀并不常见，但如果您确实可以使用它，您应该在重要的是能够准确地获得十分之一或五分之一的切片时使用它。

（最初发布在 Quora 上。）

score 226 · Accepted Answer

浮点舍入错误。由于缺少 5 的素因子，0.1 在 base-2 中无法像在 base-10 中那样准确地表示。就像 1/3 需要无限位数以十进制表示，但在 base-3 中是“0.1”， 0.1 在 base-2 中采用无限位数，而在 base-10 中则没有。并且计算机没有无限量的内存。

score 140 · Accepted Answer

我的答案很长，所以我把它分成三个部分。由于问题是关于浮点数学的，所以我把重点放在了机器的实际作用上。我还专门针对双精度（64 位）精度，但该参数同样适用于任何浮点运算。

前言

一个IEEE 754 双精度二进制浮点格式 (binary64)数字表示以下形式的数字

值 = (-1)^s * (1.m ₅₁ m ₅₀ ...m ₂ m ₁ m ₀ ) ₂ * 2 ^e-1023

64位：

第一位是符号位：1如果数字为负数，0否则为¹。
接下来的 11 位是指数，偏移1023。换句话说，从双精度数中读取指数位后，必须减去 1023 才能获得 2 的幂。
剩下的 52 位是有效数（或尾数）。在尾数中，“隐含”1.始终被省略²，因为任何二进制值的最高有效位是1。

¹ - IEEE 754 允许有符号零的概念-+0并且-0被区别对待：1 / (+0)是正无穷大；1 / (-0)是负无穷大。对于零值，尾数和指数位都为零。注意：零值（+0 和 -0）明确不归类为非正规²。

^{2 -}非正规数不是这种情况，它的偏移指数为零（和隐含的0.）。非正规双精度数的范围是 d _min ≤ |x| ≤ d _max，其中 d _min（可表示的最小非零数）为 2 ^{-1023 - 51} (≈ 4.94 * 10 ^-324 )，d _max（最大的非正规数，其尾数完全由1s 组成）为 2 ^{-1023 + 1} - 2 ^{-1023 - 51} (≈ 2.225 * 10 ^-308 )。

将双精度数转换为二进制

存在许多在线转换器将双精度浮点数转换为二进制（例如在binaryconvert.com 上），但这里有一些示例 C# 代码来获得双精度数的 IEEE 754 表示（我用冒号分隔三个部分 ( :) ：

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring(1, 11);
    string mantissa = binaryRepresentation.Substring(12);

    return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}

切入正题：原始问题

（TL；DR版本跳到底部）

Cato Johnston（提问者）问为什么 0.1 + 0.2 != 0.3。

用二进制编写（用冒号分隔三部分），值的 IEEE 754 表示为：

0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010

请注意，尾数由的重复数字组成0011。这是计算出现任何错误的关键——0.1、0.2 和 0.3 不能用有限数量的二进制位精确地表示为二进制，超过 1/9、1/3 或 1/7 可以精确地表示为十进制数字。

另请注意，我们可以将指数的幂减少 52，并将二进制表示中的点向右移动 52 位（很像 10 ^-3 * 1.23 == 10 ^-5 * 123）。^{然后，这使我们能够将二进制表示表示为它以 a * 2 p}形式表示的确切值。其中'a'是一个整数。

将指数转换为十进制，删除偏移量，并重新添加隐含1（在方括号中），0.1 和 0.2 是：

0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125

要添加两个数字，指数需要相同，即：

0.1 => 2^-3 *  0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 *  1.1001100110011001100110011001100110011001100110011010
sum =  2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397  = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794  = 0.200000000000000011102230246251565404236316680908203125
sum =  2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875

由于总和不是 2 ⁿ * 1.{bbb} 的形式，我们将指数加一并移动小数点（二进制）得到：

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)
    = 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875

现在尾数中有 53 位（第 53 位在上一行的方括号中）。IEEE 754的默认舍入模式是“舍入到最近” - 即如果数字x介于两个值a和b之间，则选择最低有效位为零的值。

a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
  = 2^-2  * 1.0011001100110011001100110011001100110011001100110011

x = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)

b = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
  = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

请注意，a和b仅在最后一位不同；...0011+ 1= ...0100。在这种情况下，最低有效位为零的值为b，因此总和为：

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
    = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

而 0.3 的二进制表示是：

0.3 => 2^-2  * 1.0011001100110011001100110011001100110011001100110011
    =  2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875

它与 0.1 和 0.2 之和的二进制表示仅相差 2 ^-54。

0.1 和 0.2 的二进制表示是 IEEE 754 允许的数字的最准确表示。由于默认的舍入模式，添加这些表示会产生一个仅在最低有效位上有所不同的值。

TL;博士

用 IEEE 7540.1 + 0.2二进制表示（用冒号分隔三部分）并将其与进行比较0.3，这是（我已将不同的位放在方括号中）：

0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3       => 0:01111111101:0011001100110011001100110011001100110011001100110[011]

转换回十进制，这些值是：

0.1 + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

差异正好是 2 ^-54，与原始值相比，约为 5.5511151231258 × 10 ^-17 - 微不足道（对于许多应用程序）。

比较浮点数的最后几位本质上是危险的，因为任何阅读了著名的“每个计算机科学家应该知道的关于浮点算术的知识”（涵盖了这个答案的所有主要部分）的人都会知道。

大多数计算器使用额外的保护数字来解决这个问题，这就是如何0.1 + 0.2给出0.3：最后几位被四舍五入。

score 136 · Accepted Answer

除了其他正确答案之外，您可能需要考虑缩放值以避免浮点运算问题。

例如：

var result = 1.0 + 2.0;     // result === 3.0 returns true

... 代替：

var result = 0.1 + 0.2;     // result === 0.3 returns false

该表达式在 JavaScript 中0.1 + 0.2 === 0.3返回false，但幸运的是浮点整数运算是精确的，因此可以通过缩放来避免十进制表示错误。

作为一个实际示例，为了避免精度至关重要的浮点问题，建议¹将货币处理为表示美分数量的整数：2550美分而不是25.50美元。

¹ Douglas Crockford：JavaScript：好的部分：附录 A - 糟糕的部分（第 105 页）。

score 65 · Accepted Answer

65

于 2016-03-16T05:27:16.597 回答

score 51 · Accepted Answer

浮点舍入误差。每个计算机科学家都应该知道的关于浮点运算的知识：

将无限多个实数压缩为有限位数需要近似表示。尽管整数有无限多，但在大多数程序中，整数计算的结果可以存储在 32 位中。相反，给定任何固定位数，大多数实数计算将产生无法使用那么多位精确表示的量。因此，浮点计算的结果必须经常四舍五入以适应其有限表示。这种舍入误差是浮点计算的特征。

score 43 · Accepted Answer

In short it's because:

Floating point numbers cannot represent all decimals precisely in binary

So just like 10/3 which does not exist in base 10 precisely (it will be 3.33... recurring), in the same way 1/10 doesn't exist in binary.

So what? How to deal with it? Is there any workaround?

In order to offer The best solution I can say I discovered following method:

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

Let me explain why it's the best solution. As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.

Imagine you are going to add up two float numbers like 0.2 and 0.7 here it is: 0.2 + 0.7 = 0.8999999999999999.

Your expected result was 0.9 it means you need a result with 1 digit precision in this case. So you should have used (0.2 + 0.7).tofixed(1) but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance

0.22 + 0.7 = 0.9199999999999999

In this example you need 2 digits precision so it should be toFixed(2), so what should be the paramter to fit every given float number?

You might say let it be 10 in every situation then:

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

Damn! What are you going to do with those unwanted zeros after 9? It's the time to convert it to float to make it as you desire:

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

Now that you found the solution, it's better to offer it as a function like this:

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

Let's try it yourself:

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();

input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

You can use it this way:

var x = 0.2 + 0.7;
floatify(x);  => Result: 0.9

As W3SCHOOLS suggests there is another solution too, you can multiply and divide to solve the problem above:

var x = (0.2 * 10 + 0.1 * 10) / 10;       // x will be 0.3

Keep in mind that (0.2 + 0.1) * 10 / 10 won't work at all although it seems the same! I prefer the first solution since I can apply it as a function which converts the input float to accurate output float.

score 35 · Accepted Answer

我的解决方法：

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

精度是指在加法过程中要在小数点后保留的位数。

score 33 · Accepted Answer

不，没有损坏，但大多数小数必须近似

概括

浮点运算是精确的，不幸的是，它与我们通常的以 10 为基数的数字表示不匹配，所以事实证明我们经常给它的输入与我们编写的内容略有不同。

即使是像 0.01、0.02、0.03、0.04 ... 0.24 这样的简单数字也不能完全表示为二进制分数。如果你数到 0.01, .02, .03 ...，直到你达到 0.25 才会得到以₂为底的第一个分数。如果您尝试使用 FP，您的 0.01 会稍微偏离，因此将其中的 25 个添加到精确的 0.25 的唯一方法将需要涉及保护位和舍入的一长串因果关系。很难预测，所以我们举手说“FP 不准确”，但这不是真的。

我们不断地为 FP 硬件提供一些以 10 为底的看似简单但以 2 为底的重复分数。

这怎么发生的？

当我们用十进制写时，每个分数（特别是每个终止小数）都是形式的有理数

一个 / (2 ⁿ x 5 ^m )

在二进制中，我们只得到2 ⁿ项，即：

一个 / 2^个

所以在十进制中，我们不能表示¹ / ₃。因为以 10 为底包含 2 作为质因数，所以我们可以写为二进制分数的每个数字也可以写为以 10 为底的分数。然而，我们写的以₁₀为底的分数几乎都不能用二进制表示。在 0.01、0.02、0.03 ... 0.99 的范围内，我们的 FP 格式只能表示三个数字：0.25、0.50 和 0.75，因为它们是 1/4、1/2 和 3/4，所有数字仅使用 2 ⁿ项的素因子。

以₁₀为底，我们不能代表¹ / ₃。但是在二进制中，我们不能做¹ / ₁₀ 或 ¹ / ₃。

因此，虽然每个二进制分数都可以写成十进制，但反之则不然。事实上，大多数小数部分都以二进制形式重复。

处理它

开发人员通常被指示进行< epsilon比较，更好的建议可能是舍入到整数值（在 C 库中：round() 和 roundf()，即保持 FP 格式）然后进行比较。舍入到特定的小数长度可以解决大多数输出问题。

此外，在实数运算问题（FP 是在早期、非常昂贵的计算机上发明的问题）中，宇宙的物理常数和所有其他测量只有相对少量的有效数字知道，所以整个问题空间无论如何是“不精确的”。FP“准确性”在这种应用程序中不是问题。

当人们尝试使用 FP 进行豆类计数时，整个问题就真正出现了。它确实适用，但前提是你坚持使用整数值，这会破坏使用它的意义。这就是我们拥有所有这些小数部分软件库的原因。

我喜欢Chris的 Pizza 回答，因为它描述了实际问题，而不仅仅是关于“不准确”的通常挥手致意。如果 FP 只是“不准确”，我们可以解决这个问题，并且会在几十年前做到这一点。我们没有这样做的原因是因为 FP 格式紧凑且快速，它是处理大量数字的最佳方式。此外，它是太空时代和军备竞赛以及早期尝试使用小内存系统解决非常慢的计算机的大问题的遗产。（有时，单个磁芯用于 1 位存储，但这是另一回事。）

结论

如果您只是在银行数豆子，那么首先使用十进制字符串表示的软件解决方案效果很好。但是你不能那样做量子色动力学或空气动力学。

score 32 · Accepted Answer

已经发布了很多好的答案，但我想再附加一个。

并非所有数字都可以通过浮点数/双精度数表示例如，数字“0.2”将在 IEEE754 浮点标准中以单精度表示为“0.200000003”。

引擎盖下存储实数的模型将浮点数表示为

即使您可以0.2轻松输入，FLT_RADIX并且DBL_RADIX是 2；对于使用“二进制浮点算术的 IEEE 标准 (ISO/IEEE Std 754-1985)”的 FPU 计算机，不是 10。

所以要准确地表示这些数字有点困难。即使您在没有任何中间计算的情况下明确指定此变量。

score 31 · Accepted Answer

一些与这个著名的双精度问题相关的统计数据。

当使用 0.1（从 0.1 到 100）的步长添加所有值 ( a + b ) 时，我们有大约 15% 的精度误差机会。请注意，错误可能会导致值稍大或稍小。这里有些例子：

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

当使用 0.1（从 100 到 0.1）的步长减去所有值（a - b其中a > b ）时，我们有大约 34% 的精度误差机会。这里有些例子：

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15% 和 34% 确实很大，所以当精度非常重要时，请始终使用 BigDecimal。使用 2 个十进制数字（步长 0.01）情况会更糟一些（18% 和 36%）。

score 19 · Accepted Answer

鉴于没有人提到这...

一些高级语言（例如 Python 和 Java）带有克服二进制浮点限制的工具。例如：

Python 的decimal模块和 Java 的BigDecimalclass，它们在内部用十进制表示法（与二进制表示法相反）表示数字。两者都具有有限的精度，因此它们仍然容易出错，但是它们解决了二进制浮点运算的大多数常见问题。

处理金钱时，小数非常好：10 美分加上 20 美分总是正好是 30 美分：
```
>>> 0.1 + 0.2 == 0.3
False
>>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
True
```
Python 的decimal模块基于IEEE 标准 854-1987。
Python 的fractions模块和 Apache Common 的BigFraction类。两者都将有理数表示为(numerator, denominator)对，并且它们可能比十进制浮点算术提供更准确的结果。

这些解决方案都不是完美的（特别是如果我们查看性能，或者如果我们需要非常高的精度），但它们仍然解决了二进制浮点运算的大量问题。

score 18 · Accepted Answer

您是否尝试过胶带解决方案？

尝试确定何时发生错误并使用简短的 if 语句修复它们，这并不漂亮，但对于某些问题，它是唯一的解决方案，这就是其中之一。

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}

我在c#的一个科学模拟项目中遇到了同样的问题，我可以告诉你，如果你忽略蝴蝶效应，它会变成一条大肥龙，咬你一口**

score 18 · Accepted Answer

出现这些奇怪的数字是因为计算机使用二进制（以 2 为底）数字系统进行计算，而我们使用十进制（以 10 为底）。

大多数小数不能用二进制或十进制或两者精确表示。结果 - 四舍五入（但精确）的数字结果。

score 17 · Accepted Answer

这个问题的许多重复项中的许多都询问浮点舍入对特定数字的影响。在实践中，通过查看感兴趣的计算的确切结果而不是仅仅阅读它更容易了解它是如何工作的。某些语言提供了这样做的方法 - 例如在 Java中将 afloat或doubleto转换为。BigDecimal

由于这是一个与语言无关的问题，因此需要与语言无关的工具，例如Decimal to Floating-Point Converter。

将其应用于问题中的数字，视为双打：

0.1 转换为 0.1000000000000000055511151231257827021181583404541015625，

0.2 转换为 0.200000000000000011102230246251565404236316680908203125，

0.3 转换为 0.299999999999999988897769753748434595763683319091796875，和

0.30000000000000004 转换为 0.3000000000000000444089209850062616169452667236328125。

手动添加前两个数字或在十进制计算器（如Full Precision Calculator）中显示实际输入的确切总和为 0.3000000000000000166533453693773481063544750213623046875。

如果四舍五入到相当于 0.3 的舍入误差将是 0.0000000000000000277555756156289135105907917022705078125。舍入到 0.300000000000000004 的等效值也会给出舍入误差 0.0000000000000000277555756156289135105907917022705078125。圆对平的决胜局适用。

回到浮点转换器，0.30000000000000004 的原始十六进制是 3fd3333333333334，它以偶数结尾，因此是正确的结果。

score 17 · Accepted Answer

Can I just add; people always assume this to be a computer problem, but if you count with your hands (base 10), you can't get (1/3+1/3=2/3)=true unless you have infinity to add 0.333... to 0.333... so just as with the (1/10+2/10)!==3/10 problem in base 2, you truncate it to 0.333 + 0.333 = 0.666 and probably round it to 0.667 which would be also be technically inaccurate.

Count in ternary, and thirds are not a problem though - maybe some race with 15 fingers on each hand would ask why your decimal math was broken...

score 10 · Accepted Answer

可以在数字计算机中实现的那种浮点数学必然使用实数的近似值和对它们的运算。（标准版本有超过 50 页的文档，并有一个委员会来处理其勘误和进一步完善。）

这种近似是不同种类的近似的混合，由于其与精确度的特定偏差方式，每个近似都可以被忽略或仔细考虑。它还涉及许多硬件和软件级别的明显异常情况，大多数人会假装没有注意到。

如果您需要无限精度（例如，使用数字 π，而不是其众多较短的替代项之一），您应该编写或使用符号数学程序。

但是，如果您认为有时浮点数学在值和逻辑上是模糊的，并且错误会迅速累积，并且您可以编写您的需求和测试以实现这一点，那么您的代码可以经常通过你的 FPU。

score 10 · Accepted Answer

Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.

The code prints the binary representation of floats in 3 separated groups

SIGN EXPONENT FRACTION

and after that it prints a sum, that, when summed with enough precision, it will show the value that really exists in hardware.

So when you write float x = 999..., the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number.

In reality, this sum is only an approximation. For the number 999,999,999 the compiler will insert in bit representation of the float the number 1,000,000,000

After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.

#include <stdio.h>
#include <limits.h>

void
xx(float *x)
{
    unsigned char i = sizeof(*x)*CHAR_BIT-1;
    do {
        switch (i) {
        case 31:
             printf("sign:");
             break;
        case 30:
             printf("exponent:");
             break;
        case 23:
             printf("fraction:");
             break;

        }
        char b=(*(unsigned long long*)x&((unsigned long long)1<<i))!=0;
        printf("%d ", b);
    } while (i--);
    printf("\n");
}

void
yy(float a)
{
    int sign=!(*(unsigned long long*)&a&((unsigned long long)1<<31));
    int fraction = ((1<<23)-1)&(*(int*)&a);
    int exponent = (255&((*(int*)&a)>>23))-127;

    printf(sign?"positive" " ( 1+":"negative" " ( 1+");
    unsigned int i = 1<<22;
    unsigned int j = 1;
    do {
        char b=(fraction&i)!=0;
        b&&(printf("1/(%d) %c", 1<<j, (fraction&(i-1))?'+':')' ), 0);
    } while (j++, i>>=1);

    printf("*2^%d", exponent);
    printf("\n");
}

void
main()
{
    float x=-3.14;
    float y=999999999;
    printf("%lu\n", sizeof(x));
    xx(&x);
    xx(&y);
    yy(x);
    yy(y);
}

Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.

-- .../terra1/stub
@ qemacs f.c
-- .../terra1/stub
@ gcc f.c
-- .../terra1/stub
@ ./a.out
sign:1 exponent:1 0 0 0 0 0 0 fraction:0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign:0 exponent:1 0 0 1 1 1 0 fraction:0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
@ bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872

That's it. The value of 999999999 is in fact

999999999.999999446351872

You can also check with bc that -3.14 is also perturbed. Do not forget to set a scale factor in bc.

The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.

score 7 · Accepted Answer

Another way to look at this: Used are 64 bits to represent numbers. As consequence there is no way more than 2**64 = 18,446,744,073,709,551,616 different numbers can be precisely represented.

However, Math says there are already infinitely many decimals between 0 and 1. IEE 754 defines an encoding to use these 64 bits efficiently for a much larger number space plus NaN and +/- Infinity, so there are gaps between accurately represented numbers filled with numbers only approximated.

Unfortunately 0.3 sits in a gap.

score 7 · Accepted Answer

Since Python 3.5 you can use math.isclose() function for testing approximate equality:

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False

score 7 · Accepted Answer

Floating point numbers are represented, at the hardware level, as fractions of binary numbers (base 2). For example, the decimal fraction:

0.125

has the value 1/10 + 2/100 + 5/1000 and, in the same way, the binary fraction:

0.001

has the value 0/2 + 0/4 + 1/8. These two fractions have the same value, the only difference is that the first is a decimal fraction, the second is a binary fraction.

Unfortunately, most decimal fractions cannot have exact representation in binary fractions. Therefore, in general, the floating point numbers you give are only approximated to binary fractions to be stored in the machine.

The problem is easier to approach in base 10. Take for example, the fraction 1/3. You can approximate it to a decimal fraction:

0.3

or better,

0.33

or better,

0.333

etc. No matter how many decimal places you write, the result is never exactly 1/3, but it is an estimate that always comes closer.

Likewise, no matter how many base 2 decimal places you use, the decimal value 0.1 cannot be represented exactly as a binary fraction. In base 2, 1/10 is the following periodic number:

0.0001100110011001100110011001100110011001100110011 ...

Stop at any finite amount of bits, and you'll get an approximation.

For Python, on a typical machine, 53 bits are used for the precision of a float, so the value stored when you enter the decimal 0.1 is the binary fraction.

0.00011001100110011001100110011001100110011001100110011010

which is close, but not exactly equal, to 1/10.

It's easy to forget that the stored value is an approximation of the original decimal fraction, due to the way floats are displayed in the interpreter. Python only displays a decimal approximation of the value stored in binary. If Python were to output the true decimal value of the binary approximation stored for 0.1, it would output:

>>> 0.1
0.1000000000000000055511151231257827021181583404541015625

This is a lot more decimal places than most people would expect, so Python displays a rounded value to improve readability:

>>> 0.1
0.1

It is important to understand that in reality this is an illusion: the stored value is not exactly 1/10, it is simply on the display that the stored value is rounded. This becomes evident as soon as you perform arithmetic operations with these values:

>>> 0.1 + 0.2
0.30000000000000004

This behavior is inherent to the very nature of the machine's floating-point representation: it is not a bug in Python, nor is it a bug in your code. You can observe the same type of behavior in all other languages that use hardware support for calculating floating point numbers (although some languages do not make the difference visible by default, or not in all display modes).

Another surprise is inherent in this one. For example, if you try to round the value 2.675 to two decimal places, you will get

>>> round (2.675, 2)
2.67

The documentation for the round() primitive indicates that it rounds to the nearest value away from zero. Since the decimal fraction is exactly halfway between 2.67 and 2.68, you should expect to get (a binary approximation of) 2.68. This is not the case, however, because when the decimal fraction 2.675 is converted to a float, it is stored by an approximation whose exact value is :

2.67499999999999982236431605997495353221893310546875

Since the approximation is slightly closer to 2.67 than 2.68, the rounding is down.

If you are in a situation where rounding decimal numbers halfway down matters, you should use the decimal module. By the way, the decimal module also provides a convenient way to "see" the exact value stored for any float.

>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')

Another consequence of the fact that 0.1 is not exactly stored in 1/10 is that the sum of ten values of 0.1 does not give 1.0 either:

>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999

The arithmetic of binary floating point numbers holds many such surprises. The problem with "0.1" is explained in detail below, in the section "Representation errors". See The Perils of Floating Point for a more complete list of such surprises.

It is true that there is no simple answer, however do not be overly suspicious of floating virtula numbers! Errors, in Python, in floating-point number operations are due to the underlying hardware, and on most machines are no more than 1 in 2 ** 53 per operation. This is more than necessary for most tasks, but you should keep in mind that these are not decimal operations, and every operation on floating point numbers may suffer from a new error.

Although pathological cases exist, for most common use cases you will get the expected result at the end by simply rounding up to the number of decimal places you want on the display. For fine control over how floats are displayed, see String Formatting Syntax for the formatting specifications of the str.format () method.

This part of the answer explains in detail the example of "0.1" and shows how you can perform an exact analysis of this type of case on your own. We assume that you are familiar with the binary representation of floating point numbers.The term Representation error means that most decimal fractions cannot be represented exactly in binary. This is the main reason why Python (or Perl, C, C ++, Java, Fortran, and many others) usually doesn't display the exact result in decimal:

>>> 0.1 + 0.2
0.30000000000000004

Why ? 1/10 and 2/10 are not representable exactly in binary fractions. However, all machines today (July 2010) follow the IEEE-754 standard for the arithmetic of floating point numbers. and most platforms use an "IEEE-754 double precision" to represent Python floats. Double precision IEEE-754 uses 53 bits of precision, so on reading the computer tries to convert 0.1 to the nearest fraction of the form J / 2 ** N with J an integer of exactly 53 bits. Rewrite :

1/10 ~ = J / (2 ** N)

in :

J ~ = 2 ** N / 10

remembering that J is exactly 53 bits (so> = 2 ** 52 but <2 ** 53), the best possible value for N is 56:

>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793

So 56 is the only possible value for N which leaves exactly 53 bits for J. The best possible value for J is therefore this quotient, rounded:

>>> q, r = divmod (2 ** 56, 10)
>>> r
6

Since the carry is greater than half of 10, the best approximation is obtained by rounding up:

>>> q + 1
7205759403792794

Therefore the best possible approximation for 1/10 in "IEEE-754 double precision" is this above 2 ** 56, that is:

7205759403792794/72057594037927936

Note that since the rounding was done upward, the result is actually slightly greater than 1/10; if we hadn't rounded up, the quotient would have been slightly less than 1/10. But in no case is it exactly 1/10!

So the computer never "sees" 1/10: what it sees is the exact fraction given above, the best approximation using the double precision floating point numbers from the "" IEEE-754 ":

>>>. 1 * 2 ** 56
7205759403792794.0

If we multiply this fraction by 10 ** 30, we can observe the values of its 30 decimal places of strong weight.

>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L

meaning that the exact value stored in the computer is approximately equal to the decimal value 0.100000000000000005551115123125. In versions prior to Python 2.7 and Python 3.1, Python rounded these values to 17 significant decimal places, displaying “0.10000000000000001”. In current versions of Python, the displayed value is the value whose fraction is as short as possible while giving exactly the same representation when converted back to binary, simply displaying “0.1”.

score 7 · Accepted Answer

The trap with floating point numbers is that they look like decimal but they work in binary.

The only prime factor of 2 is 2, while 10 has prime factors of 2 and 5. The result of this is that every number that can be written exactly as a binary fraction can also be written exactly as a decimal fraction but only a subset of numbers that can be written as decimal fractions can be written as binary fractions.

A floating point number is essentially a binary fraction with a limited number of significant digits. If you go past those significant digits then the results will be rounded.

When you type a literal in your code or call the function to parse a floating point number to a string, it expects a decimal number and it stores a binary approximation of that decimal number in the variable.

When you print a floating point number or call the function to convert one to a string it prints a decimal approximation of the floating point number. It is possible to convert a binary number to decimal exactly, but no language I'm aware of does that by default when converting to a string*. Some languages use a fixed number of significant digits, others use the shortest string that will "round trip" back to the same floating point value.

* Python does convert exactly when converting a floating point number to a "decimal.Decimal". This is the easiest way I know of to obtain the exact decimal equivalent of a floating point number.

score 5 · Accepted Answer

Imagine working in base ten with, say, 8 digits of accuracy. You check whether

1/3 + 2 / 3 == 1

and learn that this returns false. Why? Well, as real numbers we have

1/3 = 0.333.... and 2/3 = 0.666....

Truncating at eight decimal places, we get

0.33333333 + 0.66666666 = 0.99999999

which is, of course, different from 1.00000000 by exactly 0.00000001.

The situation for binary numbers with a fixed number of bits is exactly analogous. As real numbers, we have

1/10 = 0.0001100110011001100... (base 2)

and

1/5 = 0.0011001100110011001... (base 2)

If we truncated these to, say, seven bits, then we'd get

0.0001100 + 0.0011001 = 0.0100101

while on the other hand,

3/10 = 0.01001100110011... (base 2)

which, truncated to seven bits, is 0.0100110, and these differ by exactly 0.0000001.

The exact situation is slightly more subtle because these numbers are typically stored in scientific notation. So, for instance, instead of storing 1/10 as 0.0001100 we may store it as something like 1.10011 * 2^-4, depending on how many bits we've allocated for the exponent and the mantissa. This affects how many digits of precision you get for your calculations.

The upshot is that because of these rounding errors you essentially never want to use == on floating-point numbers. Instead, you can check if the absolute value of their difference is smaller than some fixed small number.

score 5 · Accepted Answer

Decimal numbers such as 0.1, 0.2, and 0.3 are not represented exactly in binary encoded floating point types. The sum of the approximations for 0.1 and 0.2 differs from the approximation used for 0.3, hence the falsehood of 0.1 + 0.2 == 0.3 as can be seen more clearly here:

#include <stdio.h>

int main() {
    printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
    printf("0.1 is %.23f\n", 0.1);
    printf("0.2 is %.23f\n", 0.2);
    printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
    printf("0.3 is %.23f\n", 0.3);
    printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
    return 0;
}

Output:

0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17

For these computations to be evaluated more reliably, you would need to use a decimal-based representation for floating point values. The C Standard does not specify such types by default but as an extension described in a technical Report.

The _Decimal32, _Decimal64 and _Decimal128 types might be available on your system (for example, GCC supports them on selected targets, but Clang does not support them on OS X).

score 4 · Accepted Answer

Since this thread branched off a bit into a general discussion over current floating point implementations I'd add that there are projects on fixing their issues.

Take a look at https://posithub.org/ for example, which showcases a number type called posit (and its predecessor unum) that promises to offer better accuracy with fewer bits. If my understanding is correct, it also fixes the kind of problems in the question. Quite interesting project, the person behind it is a mathematician it Dr. John Gustafson. The whole thing is open source, with many actual implementations in C/C++, Python, Julia and C# (https://hastlayer.com/arithmetics).

score 4 · Accepted Answer

It's actually pretty simple. When you have a base 10 system (like ours), it can only express fractions that use a prime factor of the base. The prime factors of 10 are 2 and 5. So 1/2, 1/4, 1/5, 1/8, and 1/10 can all be expressed cleanly because the denominators all use prime factors of 10. In contrast, 1/3, 1/6, and 1/7 are all repeating decimals because their denominators use a prime factor of 3 or 7. In binary (or base 2), the only prime factor is 2. So you can only express fractions cleanly which only contain 2 as a prime factor. In binary, 1/2, 1/4, 1/8 would all be expressed cleanly as decimals. While, 1/5 or 1/10 would be repeating decimals. So 0.1 and 0.2 (1/10 and 1/5) while clean decimals in a base 10 system, are repeating decimals in the base 2 system the computer is operating in. When you do math on these repeating decimals, you end up with leftovers which carry over when you convert the computer's base 2 (binary) number into a more human readable base 10 number.

From https://0.30000000000000004.com/

score 4 · Accepted Answer

Normal arithmetic is base-10, so decimals represent tenths, hundredths, etc. When you try to represent a floating-point number in binary base-2 arithmetic, you are dealing with halves, fourths, eighths, etc.

In the hardware, floating points are stored as integer mantissas and exponents. Mantissa represents the significant digits. Exponent is like scientific notation but it uses a base of 2 instead of 10. For example 64.0 would be represented with a mantissa of 1 and exponent of 6. 0.125 would be represented with a mantissa of 1 and an exponent of -3.

Floating point decimals have to add up negative powers of 2

0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d

and so on.

It is common to use a error delta instead of using equality operators when dealing with floating point arithmetic. Instead of

if(a==b) ...

you would use

delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

score 2 · Accepted Answer

I just saw this interesting issue around floating points:

Consider the following results:

error = (2**53+1) - int(float(2**53+1))

>>> (2**53+1) - int(float(2**53+1))
1

We can clearly see a breakpoint when 2**53+1 - all works fine until 2**53.

>>> (2**53) - int(float(2**53))
0

This happens because of the double-precision binary: IEEE 754 double-precision binary floating-point format: binary64

From the Wikipedia page for Double-precision floating-point format:

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. As with single-precision floating-point format, it lacks precision on integer numbers when compared with an integer format of the same size. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

Sign bit: 1 bit

Exponent: 11 bits

Significant precision: 53 bits (52 explicitly stored)

The real value assumed by a given 64-bit double-precision datum with a given biased exponent and a 52-bit fraction is

or

Thanks to @a_guest for pointing that out to me.