c - 我不希望函数 exp 正确舍入

Question

Debian系统上C数学库的GCC实现显然具有 (IEEE 754-2008) 兼容的 function 实现，这意味着舍入应始终正确：exp

（来自维基百科）IEEE 浮点标准保证加、减、乘、除、融合乘加、平方根和浮点余数将给出无限精度运算的正确舍入结果。1985 年标准中没有为更复杂的功能提供这样的保证，它们通常最多只能精确到最后一位。但是，2008 标准保证符合标准的实现将给出正确的舍入结果，该结果尊重主动舍入模式；然而，功能的实现是可选的。

事实证明，我遇到了这个特性实际上阻碍的情况，因为exp函数的确切结果通常几乎正好在两个连续double值之间的中间（1），然后程序进行了大量的进一步计算，丢失速度高达 400 倍（！）：这实际上是对我的（问得不好的 :-S）问题 #43530011的解释。

(1) 更准确地说，当的参数exp变成 (2 k + 1) × 2 ^-53的形式，其中k是一个相当小的整数（例如 242）时，就会发生这种情况。特别是，当是 2 ^-44pow (1. + x, 0.5)的数量级时，所涉及的计算倾向于exp使用这样的参数调用。x

由于在某些情况下正确舍入的实现可能非常耗时，我猜开发人员也会设计一种方法来一次获得稍微不那么精确的结果（例如，最多只能达到 0.6 ULP 或类似的结果）对于给定范围内的每个参数值（大致）有界......（2）

......但是怎么做呢？

(2) 我的意思是，我只是不希望参数的某些异常值（例如 (2 k + 1) × 2 ^-53 ）比大多数相同数量级的值更耗时；但是我当然不介意参数的某些异常值是否更快，或者大参数（绝对值）是否需要更长的计算时间。

这是一个显示该现象的最小程序：

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
 {
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
   {
    a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
    c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a).
   }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
 }

现在之后gcc -std=c99 program53.c -lm -o program53：

$ ./program53
1.000000e+06
Clock time spent: 13470008
$ ./program53 
1.000000e+06
Clock time spent: 13292721
$ ./program53 
1.000000e+06
Clock time spent: 13201616

另一方面，使用program52and （通过分别program54替换和得到）：0x200000000000000x100000000000000x40000000000000

$ ./program52
1.000000e+06
Clock time spent: 83594
$ ./program52
1.000000e+06
Clock time spent: 69095
$ ./program52
1.000000e+06
Clock time spent: 54694
$ ./program54
1.000000e+06
Clock time spent: 86151
$ ./program54
1.000000e+06
Clock time spent: 74209
$ ./program54
1.000000e+06
Clock time spent: 78612

请注意，这种现象是依赖于实现的！显然，在常见的实现中，只有Debian系统（包括Ubuntu）的实现会出现这种现象。

P.-S.：我希望我的问题不是重复的：我彻底搜索了一个类似的问题但没有成功，但也许我确实注意到使用了相关的关键字...... :-/

score 11 · Accepted Answer

要回答关于为什么需要库函数才能给出正确舍入结果的一般问题：

浮点数很难，而且常常违反直觉。不是每个程序员都读过他们应该读过的东西。当库过去允许一些稍微不准确的舍入时，当他们的不准确计算不可避免地出错并产生废话时，人们会抱怨库函数的精度。作为回应，图书馆的作者们把他们的图书馆做得很圆，所以现在人们不能把责任推给他们。

在许多情况下，有关浮点算法的特定知识可以显着提高准确性和/或性能，例如在测试用例中：

在浮点数中取exp()非常接近的数字是有问题的，因为结果是一个接近的数字，而所有的精度都差于一，所以大多数有效数字都丢失了。通过 C 数学库函数进行计算更精确（在此测试用例中明显更快）。如果真的需要它本身，它仍然要快得多。01exp(x) - 1expm1(x)exp()expm1(x) + 1

计算也存在类似的问题log(1 + x)，其中有函数log1p(x)。

加快提供的测试用例的快速修复：

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
{
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
    {
      a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
      c += expm1 (a) + 1; // replace exp() with expm1() + 1
    }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
}

对于这种情况，我的机器上的时间是这样的：

原始代码

1.000000e+06

花费的时钟时间：21543338

修改后的代码

1.000000e+06

花费的时钟时间：55076

对相关权衡具有高级知识的程序员有时可能会考虑在精度不重要的情况下使用近似结果

对于有经验的程序员，可以使用 Newton-Raphson、Taylor 或 Maclaurin 多项式等方法编写慢速函数的近似实现，特别是来自 Intel 的 MKL、AMD 的 AMCL 等库的不精确舍入的特殊函数，放宽了浮点标准合规性编译器，将精度降低到 ieee754 binary32 ( float) 或这些的组合。

请注意，对问题的更好描述可以得到更好的答案。

score 1 · Accepted Answer

关于您对@EOF 的回答的评论，@NominalAnimal 的“写你自己的”评论在这里似乎很简单，甚至微不足道，如下所示。

您上面的原始代码似乎有a=(1+2*0x400)/0x2000...=4.55e-13的 exp() 的最大可能参数（实际上应该是2*0x3FF，我数的是 13在你的0x2000...之后归零，这使它成为2x16^13）。所以4.55e-13 max 参数非常非常小。

然后微不足道的泰勒展开式是exp(a)=1+a+(a^2)/2+(a^3)/6+...对于这样的小参数，它已经为您提供了所有 double 的精度。现在，如上所述，您必须丢弃1部分，然后将其简化为expm1(a)=a*(1.+a*(1.+a/3.)/2. )应该很快！只要确保a保持小。如果它变得更大一点，只需添加下一项，a^4/24（你知道怎么做吗？）。

>>编辑<<

我修改了 OP 的测试程序如下来测试更多的东西（讨论跟在代码后面）

/* https://stackoverflow.com/questions/44346371/
   i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 16               /*denominator will be (multiplier)xBASE^EXPON*/
#define EXPON 13
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/

int main (int argc, char *argv[]) {
  int N          = (argc>1?atoi(argv[1]):1e6),
      multiplier = (argc>2?atoi(argv[2]):2),
      isexp      = (argc>3?atoi(argv[3]):1); /* flags to turn on/off exp() */
  int isexpm1    = 1;                        /* and expm1() for timing tests*/
  int i, n=0;
  double denom = ((double)multiplier)*pow((double)BASE,(double)EXPON);
  double a, c=0.0, cm1=0.0, tm1=0.0;
  clock_t start = clock();
  n=0;  c=cm1=tm1=0.0;
  /* --- to smooth random fluctuations, do the same type of computation
         a large number of (N) times with different values --- */
  for (i=0; i<N; i++) {
    n++;
    a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
                                 significant digits, and its last non-zero
                                 digit is at (fixed-point) position 53. */
    if ( isexp ) c += exp(a); /* turn this off to time expm1() alone */
    if ( isexpm1 ) {          /* you can turn this off to time exp() alone, */
      cm1 += expm1(a);        /* but difference is negligible */
      tm1 += taylorm1(a); }
    } /* --- end-of-for(i) --- */
  int nticks = (int)(clock()-start);
  printf ("N=%d, denom=%dx%d^%d, Clock time: %d (%.2f secs)\n",
         n, multiplier,BASE,EXPON,
         nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
  printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
           c,c-(double)n,cm1,tm1);
  return 0;
  } /* --- end-of-function main() --- */

编译并运行它作为测试以重现 OP 的0x2000...场景，或使用（最多三个）可选参数测试 #trials multiplier timeexp运行它，其中#trials默认为 OP 的1000000，并且对于 OP 的2x16^倍数默认为2 13（将其更改为4等，用于她的其他测试）。对于最后一个参数timeexp，输入0以仅执行expm1()（以及我不必要的 taylor-like）计算。这样做的目的是表明 OP 显示的错误时机用expm1()消失，无论乘数如何，它都“完全没有时间” 。

所以默认运行，测试和测试 1000000 4，产生（好吧，我称之为程序舍入）......

bash-4.3$ ./rounding 
N=1000000, denom=2x16^13, Clock time: 11155070 (11.16 secs)
         c=1.00000000000000023283e+06,
         c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 4
N=1000000, denom=4x16^13, Clock time: 200211 (0.20 secs)
         c=1.00000000000000011642e+06,
         c-n=1.164153e-10, cm1=5.680083e-08, tm1=5.680083e-08

因此，您首先要注意的是，OP 的cn使用exp()与cm1==tm1使用expm1()和我的 taylor 大约不同。如果你减少N，他们会达成一致，如下......

N=10, denom=2x16^13, Clock time: 941 (0.00 secs)
         c=1.00000000000007140954e+01,
         c-n=7.140954e-13, cm1=7.127632e-13, tm1=7.127632e-13
bash-4.3$ ./rounding 100
N=100, denom=2x16^13, Clock time: 5506 (0.01 secs)
         c=1.00000000000010103918e+02,
         c-n=1.010392e-11, cm1=1.008393e-11, tm1=1.008393e-11
bash-4.3$ ./rounding 1000
N=1000, denom=2x16^13, Clock time: 44196 (0.04 secs)
         c=1.00000000000011345946e+03,
         c-n=1.134595e-10, cm1=1.140730e-10, tm1=1.140730e-10
bash-4.3$ ./rounding 10000
N=10000, denom=2x16^13, Clock time: 227215 (0.23 secs)
         c=1.00000000000002328306e+04,
         c-n=2.328306e-10, cm1=1.131288e-09, tm1=1.131288e-09
bash-4.3$ ./rounding 100000
N=100000, denom=2x16^13, Clock time: 1206348 (1.21 secs)
         c=1.00000000000000232831e+05,
         c-n=2.328306e-10, cm1=1.133611e-08, tm1=1.133611e-08

至于exp()与expm1()的时间安排，请自行查看...

bash-4.3$ ./rounding 1000000 2  
N=1000000, denom=2x16^13, Clock time: 11168388 (11.17 secs)
         c=1.00000000000000023283e+06,
         c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 2 0
N=1000000, denom=2x16^13, Clock time: 24064 (0.02 secs)
         c=0.00000000000000000000e+00,
         c-n=-1.000000e+06, cm1=1.136017e-07, tm1=1.136017e-07

问题：您会注意到，一旦exp()计算达到N=10000次试验，其总和将保持不变，无论N是否较大。不知道为什么会这样。

>>__第二次编辑__<<

好的，@EOF，“你让我看起来”和你的“等级积累”评论。这确实可以使exp()总和更接近（更接近）（可能是正确的）expm1()总和。修改后的代码紧随其后，然后进行讨论。但是这里有一个讨论说明：从上面回忆乘数。那已经消失了，在它的同一个地方是expon，所以分母现在是2^expon，默认值为53，匹配 OP 的默认值（我相信更好地匹配她的想法）。好的，这是代码...

/* https://stackoverflow.com/questions/44346371/
   i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 2                /*denominator=2^EXPON, 2^53=2x16^13 default */
#define EXPON 53
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/

int main (int argc, char *argv[]) {
  int N          = (argc>1?atoi(argv[1]):1e6),
      expon      = (argc>2?atoi(argv[2]):EXPON),
      isexp      = (argc>3?atoi(argv[3]):1), /* flags to turn on/off exp() */
      ncparts    = (argc>4?atoi(argv[4]):1), /* #partial sums for c */
      binsize    = (argc>5?atoi(argv[5]):10);/* #doubles to sum in each bin */
  int isexpm1    = 1;                        /* and expm1() for timing tests*/
  int i, n=0;
  double denom = pow((double)BASE,(double)expon);
  double a, c=0.0, cm1=0.0, tm1=0.0;
  double csums[10], cbins[10][65537]; /* c partial sums and heirarchy */
  int nbins[10], ibin=0;      /* start at lowest level */
  clock_t start = clock();
  n=0;  c=cm1=tm1=0.0;
  if ( ncparts > 65536 ) ncparts=65536;  /* array size check */
  if ( ncparts > 1 ) for(i=0;i<ncparts;i++) cbins[0][i]=0.0; /*init bin#0*/
  /* --- to smooth random fluctuations, do the same type of computation
         a large number of (N) times with different values --- */
  for (i=0; i<N; i++) {
    n++;
    a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
                                 significant digits, and its last non-zero
                                 digit is at (fixed-point) position 53. */
    if ( isexp ) {            /* turn this off to time expm1() alone */
      double expa = exp(a);   /* exp(a) */
      c += expa;              /* just accumulate in a single "bin" */
      if ( ncparts > 1 ) cbins[0][n%ncparts] += expa; } /* accum in ncparts */
    if ( isexpm1 ) {          /* you can turn this off to time exp() alone, */
      cm1 += expm1(a);        /* but difference is negligible */
      tm1 += taylorm1(a); }
    } /* --- end-of-for(i) --- */
  int nticks = (int)(clock()-start);
  if ( ncparts > 1 ) {        /* need to sum the partial-sum bins */
    nbins[ibin=0] = ncparts;  /* lowest-level has everything */
    while ( nbins[ibin] > binsize ) { /* need another heirarchy level */
      if ( ibin >= 9 ) break; /* no more bins */
      ibin++;                 /* next available heirarchy bin level */
      nbins[ibin] = (nbins[ibin-1]+(binsize-1))/binsize; /*#bins this level*/
      for(i=0;i<nbins[ibin];i++) cbins[ibin][i]=0.0; /* init bins */
      for(i=0;i<nbins[ibin-1];i++) {
        cbins[ibin][(i+1)%nbins[ibin]] += cbins[ibin-1][i]; /*accum in nbins*/
        csums[ibin-1] += cbins[ibin-1][i]; } /* accumulate in "one bin" */
      } /* --- end-of-while(nprevbins>binsize) --- */
    for(i=0;i<nbins[ibin];i++) csums[ibin] += cbins[ibin][i]; /*highest level*/
    } /* --- end-of-if(ncparts>1) --- */
  printf ("N=%d, denom=%d^%d, Clock time: %d (%.2f secs)\n", n, BASE,expon,
         nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
  printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
           c,c-(double)n,cm1,tm1);
  if ( ncparts > 1 ) { printf("\t binsize=%d...\n",binsize);
    for (i=0;i<=ibin;i++) /* display heirarchy */
      printf("\t level#%d: #bins=%5d, c-n=%e\n",
      i,nbins[i],csums[i]-(double)n); }
  return 0;
  } /* --- end-of-function main() --- */

好的，现在您可以注意到旧timeexp之后的两个附加命令行参数。它们是整个 #trials 将被分配到的初始 bin 数量的ncpart 。因此，在层次结构的最低级别，每个 bin 应该（模错误:) 具有#trials/ncparts双倍的总和。之后的参数是binsize，它将是每个连续级别的每个 bin 中的双精度数之和，直到最后一个级别的 #bins 少于（或等于）binsize。所以这里有一个例子，将 1000000 次试验分成 50000 个 bin，这意味着最低级别为 20doubles/bin，之后为 5doubles/bin...

bash-4.3$ ./rounding 1000000 53 1 50000 5 
N=1000000, denom=2^53, Clock time: 11129803 (11.13 secs)
         c=1.00000000000000465661e+06,
         c-n=4.656613e-09, cm1=1.136017e-07, tm1=1.136017e-07
         binsize=5...
         level#0: #bins=50000, c-n=4.656613e-09
         level#1: #bins=10002, c-n=1.734588e-08
         level#2: #bins= 2002, c-n=7.974450e-08
         level#3: #bins=  402, c-n=1.059379e-07
         level#4: #bins=   82, c-n=1.133885e-07
         level#5: #bins=   18, c-n=1.136214e-07
         level#6: #bins=    5, c-n=1.138542e-07

注意exp()的cn如何很好地收敛到expm1()值。但请注意它在第 5 级是最好的，并且根本没有统一收敛。请注意，如果您将#trials分解为仅 5000 个初始箱，您会得到同样好的结果，

bash-4.3$ ./rounding 1000000 53 1 5000 5
N=1000000, denom=2^53, Clock time: 11165924 (11.17 secs)
         c=1.00000000000003527384e+06,
         c-n=3.527384e-08, cm1=1.136017e-07, tm1=1.136017e-07
         binsize=5...
         level#0: #bins= 5000, c-n=3.527384e-08
         level#1: #bins= 1002, c-n=1.164153e-07
         level#2: #bins=  202, c-n=1.158332e-07
         level#3: #bins=   42, c-n=1.136214e-07
         level#4: #bins=   10, c-n=1.137378e-07
         level#5: #bins=    4, c-n=1.136214e-07

事实上，使用ncparts和binsize似乎并没有表现出太大的敏感性，而且也不总是“越多越好”（即binsize越少）。所以我不确定到底发生了什么。可能是一个错误（或两个），或者可能是@EOF 的另一个问题......？？？

>>编辑——显示对加法“二叉树”层次结构的示例<<

下面的示例根据@EOF 的注释添加（注意：重新复制前面的代码。我必须将每个下一级的 nbins[ibin] 计算编辑为nbins[ibin]=(nbins[ibin-1]+(binsize-1 ))/binsize;来自nbins[ibin]=(nbins[ibin-1]+2*binsize)/binsize;这“过于保守”而无法创建...16,8,4,2序列）

bash-4.3$ ./rounding 1024 53 1 512 2
N=1024, denom=2^53, Clock time: 36750 (0.04 secs)
         c=1.02400000000011573320e+03,
         c-n=1.157332e-10, cm1=1.164226e-10, tm1=1.164226e-10
         binsize=2...
         level#0: #bins=  512, c-n=1.159606e-10
         level#1: #bins=  256, c-n=1.166427e-10
         level#2: #bins=  128, c-n=1.166427e-10
         level#3: #bins=   64, c-n=1.161879e-10
         level#4: #bins=   32, c-n=1.166427e-10
         level#5: #bins=   16, c-n=1.166427e-10
         level#6: #bins=    8, c-n=1.166427e-10
         level#7: #bins=    4, c-n=1.166427e-10
         level#8: #bins=    2, c-n=1.164153e-10

>>编辑——在下面的评论中展示@EOF的优雅解决方案<<

根据下面@EOF 的评论，“对加法”可以优雅地递归完成，我在这里复制。（注意递归结束时的情况 0/1 以处理 n 偶数/奇数。）

  /* Quoting from EOF's comment...
   What I (EOF) proposed is effectively a binary tree of additions:
   a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
   Like this: Add adjacent pairs of elements, this produces
   a new sequence of n/2 elements.
   Recurse until only one element is left.
   (Note that this will require n/2 elements of storage,
   rather than a fixed number of bins like your implementation) */
  double trecu(double *vals, double sum, int n) {
      int midn = n/2;
      switch (n) {
        case  0: break;
        case  1: sum += *vals; break;
        default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
      return(sum);
      }

score 1 · Accepted Answer

这是对 EOF 先前评论的“答案”/跟进，即他的 trecu() 算法和他的“二叉树求和”建议的代码。阅读本文之前的“先决条件”是阅读该讨论。将所有这些收集在一个有组织的地方会很好，但我还没有这样做......

...我所做的是将EOF的trecu（）从我通过修改OP的原始测试程序编写的前面的答案构建到测试程序中。但是后来我发现 trecu() 生成的答案与使用exp()的“普通总和” c完全相同（我的意思是），而不是我们期望从更准确的二叉树中得到的 sum cm1使用expm1()总和。

但是那个测试程序有点（可能是两位：）“令人费解”（或者，正如 EOF 所说，“不可读”），所以我编写了一个单独的较小的测试程序，如下所示（示例运行和下面的讨论），分别测试/练习 trecu()。此外，我还在下面的代码中编写了函数 bintreesum()，它抽象/封装了我嵌入到前面的测试程序中的二叉树求和的迭代代码。在上述情况下，我的迭代代码确实接近cm1答案，这就是为什么我希望 EOF 的递归 trecu() 也能做到这一点。总而言之，在下面，同样的事情发生了—— bintreesum() 仍然接近正确答案，而 trecu() 离正确答案更远，准确地再现了“普通和”。

我们在下面求和的只是 sum(i),i=1...n，也就是众所周知的 n(n+1)/2。但这不太正确——为了重现 OP 的问题，summand 不是单独的 sum(i)，而是 sum(1+i*10^(-e))，其中 e 可以在命令行上给出。因此，比如说，n=5，你得到的不是 15，而是 5.000...00015，或者对于 n=6，你得到 6.000...00021，等等。为了避免冗长的格式，我 printf( ) sum-n 删除该整数部分。好的？？？所以这里的代码...

/* Quoting from EOF's comment...
   What I (EOF) proposed is effectively a binary tree of additions:
   a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
   Like this: Add adjacent pairs of elements, this produces
   a new sequence of n/2 elements.
   Recurse until only one element is left. */
#include <stdio.h>
#include <stdlib.h>

double trecu(double *vals, double sum, int n) {
  int midn = n/2;
  switch (n) {
    case  0: break;
    case  1: sum += *vals; break;
    default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
  return(sum);
  } /* --- end-of-function trecu() --- */

double bintreesum(double *vals, int n, int binsize) {
  double binsum = 0.0;
  int nbin0 = (n+(binsize-1))/binsize,
      nbin1 = (nbin0+(binsize-1))/binsize,
      nbins[2] = { nbin0, nbin1 };
  double *vbins[2] = {
            (double *)malloc(nbin0*sizeof(double)),
            (double *)malloc(nbin1*sizeof(double)) },
         *vbin0=vbins[0], *vbin1=vbins[1];
  int ibin=0, i;
  for ( i=0; i<nbin0; i++ ) vbin0[i] = 0.0;
  for ( i=0; i<n; i++ ) vbin0[i%nbin0] += vals[i];
  while ( nbins[ibin] > 1 ) {
    int jbin = 1-ibin;        /* other bin, 0<-->1 */
    nbins[jbin] = (nbins[ibin]+(binsize-1))/binsize;
    for ( i=0; i<nbins[jbin]; i++ ) vbins[jbin][i] = 0.0;
    for ( i=0; i<nbins[ibin]; i++ )
      vbins[jbin][i%nbins[jbin]] += vbins[ibin][i];
    ibin = jbin;              /* swap bins for next pass */
    } /* --- end-of-while(nbins[ibin]>0) --- */
  binsum = vbins[ibin][0];
  free((void *)vbins[0]);  free((void *)vbins[1]);
  return ( binsum );
  } /* --- end-of-function bintreesum() --- */

#if defined(TESTTRECU)
#include <math.h>
#define MAXN (2000000)
int main(int argc, char *argv[]) {
  int N       = (argc>1? atoi(argv[1]) : 1000000 ),
      e       = (argc>2? atoi(argv[2]) : -10 ),
      binsize = (argc>3? atoi(argv[3]) : 2 );
  double tens = pow(10.0,(double)e);
  double *vals = (double *)malloc(sizeof(double)*MAXN),
         sum = 0.0;
  double trecu(), bintreesum();
  int i;
  if ( N > MAXN ) N=MAXN;
  for ( i=0; i<N; i++ ) vals[i] = 1.0 + tens*(double)(i+1);
  for ( i=0; i<N; i++ ) sum += vals[i];
  printf(" N=%d, Sum_i=1^N {1.0 + i*%.1e} - N  =  %.8e,\n"
         "\t plain_sum-N  = %.8e,\n"
         "\t trecu-N      = %.8e,\n"
         "\t bintreesum-N = %.8e \n",
         N, tens, tens*((double)N)*((double)(N+1))/2.0,
          sum-(double)N,
         trecu(vals,0.0,N)-(double)N,
         bintreesum(vals,N,binsize)-(double)N );
  } /* --- end-of-function main() --- */
#endif

因此，如果将其保存为 trecu.c，则将其编译为cc –DTESTTRECU trecu.c –lm –o trecu然后使用零到三个可选命令行参数运行为trecu #trials e binsize默认为 #trials=1000000 (像 OP 的程序），e=–10 和 binsize=2（我的 bintreesum() 函数执行二叉树求和而不是更大尺寸的 bin）。

以下是一些说明上述问题的测试结果，

bash-4.3$ ./trecu              
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-10} - N  =  5.00000500e+01,
         plain_sum-N  = 5.00000500e+01,
         trecu-N      = 5.00000500e+01,
         bintreesum-N = 5.00000500e+01 
bash-4.3$ ./trecu 1000000 -15
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-15} - N  =  5.00000500e-04,
         plain_sum-N  = 5.01087168e-04,
         trecu-N      = 5.01087168e-04,
         bintreesum-N = 5.00000548e-04 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -16
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-16} - N  =  5.00000500e-05,
         plain_sum-N  = 6.67552231e-05,
         trecu-N      = 6.67552231e-05,
         bintreesum-N = 5.00001479e-05 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -17
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-17} - N  =  5.00000500e-06,
         plain_sum-N  = 0.00000000e+00,
         trecu-N      = 0.00000000e+00,
         bintreesum-N = 4.99992166e-06

所以你可以看到，对于默认运行，e=–10，每个人都做对了。也就是说，上面写着“Sum”的行只做了 n(n+1)/2 的事情，所以大概显示了正确的答案。下面的每个人都同意默认的 e=–10 测试用例。但对于 e=–15 和 e=–16 以下的情况，trecu() 与 plain_sum 完全一致，而 bintreesum 则非常接近正确答案。最后，对于 e=–17，plain_sum 和 trecu() 已经“消失”，而 bintreesum() 仍然很好地挂在那里。

所以 trecu() 正确地进行了求和，但它的递归显然没有做我更直接的迭代 bintreesum() 显然正确地做的那种“二叉树”类型的事情。这确实表明 EOF 对“二叉树求和”的建议在这些 1+epsilon 类型的情况下实现了对 plain_sum 的相当大的改进。所以我们真的很想看到他的 trecu() 递归工作！！！当我最初看它时，我认为它确实有效。但是他的默认情况下的双重递归（有一个特殊的名称吗？）显然比我想象的更令人困惑（至少对我来说：）。就像我说的，它在做总和，但不是“二叉树”的事情。

好的，那么谁愿意接受挑战并解释该 trecu() 递归中发生了什么？而且，也许更重要的是，修复它，让它达到预期的效果。谢谢。

c - 我*不*希望函数 exp 正确舍入

3 回答 3

要回答关于为什么需要库函数才能给出正确舍入结果的一般问题：

在许多情况下，有关浮点算法的特定知识可以显着提高准确性和/或性能，例如在测试用例中：

对相关权衡具有高级知识的程序员有时可能会考虑在精度不重要的情况下使用近似结果

Related

Reference

c - 我不希望函数 exp 正确舍入