c++ - 重新审视浮点比较

Question

这个话题在 StackOverflow 上出现过很多次，但我相信这是一个新的尝试。是的，我已经阅读了Bruce Dawson 的文章和What Every Computer Scientist Should Know About Floating-Point Arithmetic以及这个不错的答案。

据我了解，在典型系统上比较浮点数是否相等时有四个基本问题：

浮点计算不准确
是否a-b“小”取决于规模a和b
是否a-b“小”取决于aand的类型b（例如float、double、long double）
浮点通常具有 +-infinity、NaN 和非规范化表示，其中任何一个都可能干扰朴素的公式

这个答案——又名。“谷歌方法”——似乎很流行。它确实处理了所有棘手的情况。它确实非常精确地扩展了比较，检查两个值是否在彼此固定数量的ULP内。因此，例如，一个非常大的数字将“几乎等于”与无穷大进行比较。

然而：

在我看来，这非常混乱。
它不是特别便携，严重依赖内部表示，使用联合从浮点数中读取位等。
它只处理单精度和双精度 IEEE 754（特别是没有 x86 long double）

我想要类似的东西，但使用标准 C++ 并处理长双精度数。“标准”是指 C++03（如果可能）和 C++11（如果需要）。

这是我的尝试。

#include <cmath>
#include <limits>
#include <algorithm>

namespace {
// Local version of frexp() that handles infinities specially.
template<typename T>
T my_frexp(const T num, int *exp)
{
    typedef std::numeric_limits<T> limits;

    // Treat +-infinity as +-(2^max_exponent).
    if (std::abs(num) > limits::max())
    {
        *exp = limits::max_exponent + 1;
        return std::copysign(0.5, num);
    }
    else return std::frexp(num, exp);
}
}

template<typename T>
bool almostEqual(const T a, const T b, const unsigned ulps=4)
{
    // Handle NaN.
    if (std::isnan(a) || std::isnan(b))
        return false;

    typedef std::numeric_limits<T> limits;

    // Handle very small and exactly equal values.
    if (std::abs(a-b) <= ulps * limits::denorm_min())
        return true;

    // frexp() does the wrong thing for zero.  But if we get this far
    // and either number is zero, then the other is too big, so just
    // handle that now.
    if (a == 0 || b == 0)
        return false;

    // Break the numbers into significand and exponent, sorting them by
    // exponent.
    int min_exp, max_exp;
    T min_frac = my_frexp(a, &min_exp);
    T max_frac = my_frexp(b, &max_exp);
    if (min_exp > max_exp)
    {
        std::swap(min_frac, max_frac);
        std::swap(min_exp, max_exp);
    }

    // Convert the smaller to the scale of the larger by adjusting its
    // significand.
    const T scaled_min_frac = std::ldexp(min_frac, min_exp-max_exp);

    // Since the significands are now in the same scale, and the larger
    // is in the range [0.5, 1), 1 ulp is just epsilon/2.
    return std::abs(max_frac-scaled_min_frac) <= ulps * limits::epsilon() / 2;
}

我声称此代码 (a) 处理所有相关情况，(b) 与 Google 实现的 IEEE-754 单精度和双精度实现相同，并且 (c) 是完全标准的 C++。

这些主张中的一项或多项几乎肯定是错误的。我会接受任何证明这一点的答案，最好是修复。一个好的答案应该包括以下一项或多项：

特定输入的差异超过ulps最后位置的单位，但此函数返回 true（差异越大越好）
特定输入的差异最多为ulps最后位置的 Units，但此函数返回 false（差异越小越好）
我错过的任何案例
此代码依赖于未定义行为或中断的任何方式取决于实现定义的行为。（如果可能，请引用相关规范。）
修复您发现的任何问题
任何在不破坏代码的情况下简化代码的方法

我打算在这个问题上放一个不平凡的赏金。

score 25 · Accepted Answer

“几乎等于”不是一个好函数

4 不是一个合适的值：您指出的答案是“因此，4 应该足以满足日常使用”，但没有任何依据。事实上，在普通情况下，通过不同方式以浮点计算的数字可能会因许多 ULP 不同而不同，即使如果通过精确数学计算它们会相等。因此，容差不应有默认值；应该要求每个用户提供他们自己的，希望基于对其代码的彻底分析。

作为为什么默认 4 ULP 不好的示例，请考虑1./49*49-1. 数学上精确的结果为 0，但计算结果（64 位 IEEE 754 二进制）为 2 ^-53，误差超过精确结果的 10 ³⁰⁷ ULP 和计算结果的几乎 10 ¹⁶ ULP。

有时，没有合适的值：在某些情况下，容差不能与被比较的值相关，既不是数学上精确的相对容差，也不是量化的 ULP 容差。例如，FFT 中几乎每个输出值都受到几乎每个输入值的影响，并且任何一个元素的误差都与其他元素的大小有关。必须为“几乎等于”例程提供额外的上下文，其中包含有关潜在错误的信息。

“几乎等于”的数学性质很差：这显示了“几乎等于”的缺点之一：缩放会改变结果。下面的代码打印 1 和 0。

double x0 = 1.1;
double x1 = 1.1 + 3*0x1p-52;
std::cout << almostEqual(x0, x1) << "\n";
x0 *= .8;
x1 *= .8;
std::cout << almostEqual(x0, x1) << "\n";

另一个失败是它不是传递的。almostEqual(a, b)并不almostEqual(b, c)意味着almostEqual(a, c)。

极端情况下的错误

almostEqual(1.f, 1.f/11, 0x745d17)错误地返回 1。

1.f/11 是 0x1.745d18p-4（十六进制浮点表示法，表示 1.745d18 ₁₆ •2 ^-4）。从 1（即 0x10p-4）中减去它，得到 0xe.8ba2e8p-4。由于 1 的 ULP 是 0x1p-23，即 0xe.8ba2e8p19 ULP = 0xe8ba2e.8/2 ULP（移位 20 位并除以 2，净 19 位）= 0x745d17.4 ULP。这超出了 0x745d17 的指定容差，因此正确答案为 0。

此错误是由四舍五入引起的max_frac-scaled_min_frac。

摆脱这个问题的一个简单方法是指定ulps必须小于.5/limits::epsilon。然后max_frac-scaled_min_frac只有当差异（即使四舍五入）超过时才会发生舍入ulps；如果差值小于这个值，则减法是精确的，根据 Sterbenz 引理。

有一个关于使用long double来纠正这个问题的建议。但是，long double不会更正这一点。考虑将 1 和 -0x1p-149f 与ulps设置为进行比较1/limits::epsilon。除非您的有效位有 149 位，否则减法结果会舍入为 1，即小于或等于1/limits::epsilonULP。然而，数学上的差异显然超过了 1。

小注

该表达式factor * limits::epsilon / 2转换factor为浮点类型，这会导致factor无法精确表示的大值的舍入错误。很可能，例程不打算用于如此大的值（在中的数百万 ULP float），因此应该将其指定为例程的限制而不是错误。

score 2 · Accepted Answer

简化：您可以通过首先完全丢弃非有限情况来避免 my_freexp ：

if( ! std::isfinite(a) || ! std::isfinite(b) )
    return a == b;

似乎 isfinite 至少在 C++11 中

编辑但是，如果打算limits::infinity()在 1 ulp 之内，limits::max()
则上述简化不成立，但 my_frexp() 不应该返回limits::max_exponent+1in *exp，而不是 max_exponent+2 吗？

score 0 · Accepted Answer

未来证明：如果您将来想将此类比较扩展到十进制浮点http://en.wikipedia.org/wiki/Decimal64_floating-point_format，并假设 ldexp() 和 frexp() 将使用正确的基数处理此类类型，然后严格来说， 0.5 inreturn std::copysign(0.5, num);应该替换为T(1)/limits::radix()-std::ldexp(T(1),-1)或其他东西......（我在 std::numeric_limits 中找不到方便的常量）

编辑正如 Nemo 评论的那样，ldexp 和 frexp 将使用正确的 FLOAT_RADIX 的假设是错误的，他们坚持使用 2...

所以未来证明便携版本也应该使用：

std::scalbn(x,n)代替std::ldexp(x,n)
exp=std::ilogb(std::abs(x)),y=std::scalbn(x,-exp)代替y=frexp(x,&exp)
现在上面的 y in 是 [1,FLOAT_RADIX) 而不是 [T(1)/Float_Radix,1)，copysign(T(1),num)对于 my_frexp 的无限情况，返回而不是 0.5，并测试ulps*limits::epsilon()而不是 ulps*epsilon()/2

这也需要标准 >= C++11

score 0 · Accepted Answer

您的方法的两个缺点是您的实现依赖于几个函数，<cmath>这些函数显然只适用于标准中的浮点数，遗憾的是目前不能保证是constexpr（虽然使用 GCC constexpr，您的函数的一个版本将编译它不会与大多数其他编译器一起使用）。

Google 方法相对于您的方法的优势在于，在 C++20 中，您可以将GoogleTest 版本中的union(或 a ) 替换为function。此外，通过检查和使用新引入的方法，您应该能够确保假定的内部表示是正确的。通过引入两个自定义特征，您可以获得很大的灵活性，允许您将函数扩展到非标准浮动类型：reinterpret_caststd::bit_castconstexprstd::numeric_limits<T>::is_iec559std::endian

确定等长无符号整数的一个特征（在我的代码中称为）：通过这种方式，您可以通过添加专门化（例如并使用GCC或Boost MultiprecisionUIntEquiv ）将代码扩展为任意长度的浮点数。template<> class UIntEquiv<16>__uint128_t
另一个用于实际浮点类型（在我的代码中称为FloatTrait），它的符号、尾数和指数掩码，以及无穷大和 NaN 的表示。通过这种方式，您可以将比较扩展到非 C++ 标准浮点数，例如float256通过添加另一个专业化。

我从下面的Gist中插入了代码：

#include <bit>
#include <concepts>
#include <cstdint>
#include <limits>
#include <type_traits>


namespace detail {
  // Trait for excluding incomplete types: See https://stackoverflow.com/a/44229779/9938686
  template <typename T, std::size_t = sizeof(T)>
  std::true_type is_complete(T *);

  std::false_type is_complete(...);
}

template <typename T>
using is_complete = decltype(detail::is_complete(std::declval<T*>()));

template <typename T>
static constexpr bool is_complete_v = is_complete<T>::value;

namespace detail {

  // Class for determining the corresponding unsigned integer type with equal length
  template <std::size_t N>
  class UIntEquiv {
    protected:
      UIntEquiv() = delete;
      UIntEquiv(UIntEquiv const&) = delete;
      UIntEquiv(UIntEquiv&&) = delete;
      UIntEquiv& operator= (UIntEquiv const&) = delete;
      UIntEquiv& operator= (UIntEquiv&&) = delete;

      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint8_t))>* = nullptr>
      static constexpr std::uint8_t determineUIntEquivalent() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint16_t))>* = nullptr>
      static constexpr std::uint16_t determineUIntEquivalent() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint32_t))>* = nullptr>
      static constexpr std::uint32_t determineUIntEquivalent() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint64_t))>* = nullptr>
      static constexpr std::uint64_t determineUIntEquivalent() noexcept;

    public:
      using type = decltype(determineUIntEquivalent<N>());
  };
  
  // Potentially add specialisation of UIntEquiv for longer unsigned integer types here
  // e.g. GCC's __uint128_t: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html
  // or Boost: https://www.boost.org/doc/libs/1_67_0/libs/multiprecision/doc/html/boost_multiprecision/tut/ints/cpp_int.html

  template <std::size_t N>
  using UIntEquiv_t = typename UIntEquiv<N>::type;

  // You can provide specialisations of this trait for proprietary floating types
  // e.g. Boost float128: https://www.boost.org/doc/libs/1_65_1/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/float128.html)
  template <typename T>
  class FloatTrait;

  // Specialised trait for floating point number types according to IEEE754
  template <typename T>
  requires std::is_floating_point_v<T> && std::numeric_limits<T>::is_iec559 && (std::endian::native == std::endian::little)
  class FloatTrait<T> {
    public:
      static constexpr std::size_t number_of_bytes {sizeof(T)};
      static constexpr std::size_t number_of_bits {number_of_bytes*std::numeric_limits<std::uint8_t>::digits};
      using Bytes = UIntEquiv_t<number_of_bytes>;

      static constexpr std::size_t number_of_sign_bits {1};
      static constexpr std::size_t number_of_fraction_bits {std::numeric_limits<T>::digits-1};
      static constexpr std::size_t number_of_exponent_bits {number_of_bits - number_of_sign_bits - number_of_fraction_bits};

      static constexpr Bytes sign_mask {Bytes{1} << (number_of_bits - 1)};
      static constexpr Bytes fraction_mask {~Bytes{0} >> (number_of_exponent_bits + 1)};
      static constexpr Bytes exponent_mask {~(sign_mask | fraction_mask)};

      static constexpr bool isNan(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const exponent_bytes {extractExponent(bytes)};
        auto const fraction_bytes {extractFraction(bytes)};
        return (exponent_bytes == exponent_mask) && (fraction_bytes != 0);
      }

      static constexpr bool isPosInf(T const t) noexcept {
        return isPos(t) && isInf(t);
      }

      static constexpr bool isNegInf(T const t) noexcept {
        return isNeg(t) && isInf(t);
      }

      static constexpr bool isNeg(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const sign_bytes {extractSign(bytes)};
        return sign_bytes != 0;
      }

      // Optional helper functions
      static constexpr bool isPos(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const sign_bytes {extractSign(bytes)};
        return sign_bytes == 0;
      }

      static constexpr bool isInf(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const exponent_bytes {extractExponent(bytes)};
        auto const fraction_bytes {extractFraction(bytes)};
        return (exponent_bytes == exponent_mask) && (fraction_bytes == 0);
      }

      static constexpr Bytes extractSign(Bytes const bytes) noexcept {
        return bytes & sign_mask;
      }

      static constexpr Bytes extractExponent(Bytes const bytes) noexcept {
        return bytes & exponent_mask;
      }

      static constexpr Bytes extractFraction(Bytes const bytes) noexcept {
        return bytes & fraction_mask;
      }

    protected:
      FloatTrait() = delete;
      FloatTrait(FloatTrait const&) = delete;
      FloatTrait(FloatTrait&&) = delete;
      FloatTrait& operator= (FloatTrait const&) = delete;
      FloatTrait& operator= (FloatTrait&&) = delete;
  };

  template <typename T>
  requires is_complete_v<FloatTrait<T>> && is_complete_v<UIntEquiv_t<sizeof(T)>>
  class FloatView {
    public:
      using Trait = FloatTrait<T>;
      using Bytes = UIntEquiv_t<sizeof(T)>;

      explicit constexpr FloatView(T const t) noexcept
        : value{t} {
        return;
      }
      FloatView() = default;
      FloatView(FloatView const&) = default;
      FloatView(FloatView&&) = default;
      FloatView& operator= (FloatView const&) = default;
      FloatView& operator= (FloatView&&) = default;

      constexpr bool isNearlyEqual(FloatView const rhs, std::uint8_t const max_distance = 4) const noexcept {
        if (Trait::isNan(value) || Trait::isNan(rhs.value)) {
          return false;
        // Forces infinity to only be equal to inifinity and not also to very large numbers
        } else if (Trait::isNegInf(value) != Trait::isNegInf(rhs.value)) {
          return false;
        } else if (Trait::isPosInf(value) != Trait::isPosInf(rhs.value)) {
          return false;
        }
        return computeDistance(value, rhs.value) <= max_distance;
      }

    protected:
      T value;

      static constexpr Bytes signMagnitudeToBiased(T const t) noexcept {
        auto const b {std::bit_cast<Bytes>(t)};
        if (Trait::isNeg(t)) {
          return ~b + Bytes{1};
        } else {
          return Trait::sign_mask | b;
        }
      }

      static constexpr Bytes computeDistance(T const a, T const b) noexcept {
        auto const biased1 = signMagnitudeToBiased(a);
        auto const biased2 = signMagnitudeToBiased(b);
        return (biased1 >= biased2) ? (biased1 - biased2) : (biased2 - biased1);
      }
  };

}

template <typename T>
constexpr bool isNearlyEqual(T const lhs, T const rhs, std::uint8_t const max_distance = 4) noexcept {
  detail::FloatView<T> const a {lhs};
  detail::FloatView<T> const b {rhs};
  return a.isNearlyEqual(b, max_distance);
}

在这里试试

c++ - 重新审视浮点比较

4 回答 4

“几乎等于”不是一个好函数

极端情况下的错误

小注

Related

Reference