c - 一起找到最小值和最大值：算法应该更快但不是

Question

我正在尝试实现一种算法来查找文件中一组 long 中的最小值和最大值。我的测试文件包含十亿长。

该算法按预期工作，但执行速度不比原始版本快。它应该明显更快，因为原始版本执行大约 2n 次比较，而此版本执行 3n/2 次比较。

$ time ./findminmax_naive somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m24.156s
user    0m4.956s
sys     0m3.896s

$ time ./findminmax_faster somelongs 
count: 1000000000
min: 0
max: 2147483647

real    0m25.048s
user    0m6.948s
sys     0m3.980s

这是“天真的”版本：

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
            perror("fopen");
        readcount = 1024;
        if (ac == 3)
            readcount = atol(av[2]);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i++) {
                        count++;
                        if (n[i] < min)
                            min = n[i];
                        if (n[i] > max)
                            max = n[i];
                }
                if (feof(f))
                        break;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

这是（应该）“更快”版本的代码：

#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int ac, char *av[])
{
        FILE *f;
        long count, readcount, i, min, max;
        size_t rlen;
        long *n;

        if (ac != 2 && ac != 3) {
                fprintf(stderr, "Usage: %s <file> [readcount]\n", av[0]);
                exit(1);
        }

        f = fopen(av[1], "r");
        if (f == NULL)
                perror("fopen");
        readcount = 1024;
        if (ac == 3)
                readcount = atol(av[2]);
        readcount = (readcount + 1) & (-1 << 1);
        n = alloca(sizeof (long) * readcount);
        rlen = fread(n, sizeof (*n), 1, f);
        min = max = n[0];
        count = 1;
        while (1) {
                rlen = fread(n, sizeof (*n), readcount, f);
                for (i = 0; i < (long)rlen; i += 2) {
                        count += 2;
                        if (n[i] < n[i + 1]) {
                                if (n[i] < min)
                                        min = n[i];
                                if (n[i + 1] > max)
                                        max = n[i + 1];
                        } else {
                                if (n[i + 1] < min)
                                        min = n[i + 1];
                                if (n[i] > max)
                                        max = n[i];
                        }
                }
                if (feof(f))
                        break;
        }
        if (rlen % 2) {
                if (n[rlen - 1] < min)
                        min = n[rlen - 1];
                if (n[rlen - 1] > max)
                        max = n[rlen - 1];
                count++;
        }
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

你知道我错过了什么吗？

谢谢你的帮助。

——杰瑞米

score 4 · Accepted Answer

关键是分支预测。除非文件以病态的最坏情况顺序排序，否则天真的版本将执行几乎每次都正确预测的 2n 个分支。您的“聪明”版本执行几乎从未正确预测的 n/2 个分支，以及可能正确预测的额外 n 个比较。

错误预测的分支成本在很大程度上取决于 cpu 架构甚至特定的 cpu 模型，但至少我预计错误预测的分支的成本是正确预测的分支的几倍。在极端情况下，正确预测的分支可能具有零周期的有效成本。

作为一个有趣的例子，我最近对优化进行了实验strlen，发现孤立地进行非常幼稚的展开strlen——一次比较和分支一个字节——比聪明的矢量化方法更快。这几乎可以肯定是因为strlen具有一个特殊属性，即直到最后一个分支之前的每个分支都将始终被正确预测。

顺便说一句，为了检验我的假设，试试这个输入模式：

999999999, 1000000001, 999999998, 1000000002, 999999997, 1000000003, ...

它将为幼稚算法提供最坏情况的分支预测，并为您的聪明版本的外部条件提供最佳情况。

score 1 · Accepted Answer

正如@chr 所说，“文件 I/O 将使算法本身的任何优化相形见绌”。

此外，较少的比较并不等于较少的运行时间消耗。这两种算法的时间复杂度为 O(n)，忽略了实际语句成本和抽象成本。

例如，作为这两种算法的两个粗略框架，时间消耗就是你程序中所有语句的时间成本。

例如：

//max and min initlaized as 0.
//c1,... reprents the time cost of each instruction.
while(i<count) {//c1
    if(a[i]>max)  //c2
        max =  a[i]; //c3
    i++;    //c4
}
//search of min is like below

时间成本：

T1 = 2n*c1 + 2n*c2 + x*c3 + y*c3 + 2n*c4 = 2n * (c1+c2+c4) +(x+y)*c3

其中 x 和 y 取决于您的数据的顺序。

并且，(3/2)n 的比较，

while(i<count)  //c1 
    if(a[i]<a[i+1]) {//c5
        if(a[i]<min) //c2
            min = a[i]; //c3
        if(a[i+1>max]) //c2
            max = a[i+1]; //c3
    }
    else
        ...
        //same as below,that swap i and i+1
    i+=2; //c6
}

时间成本：

T2 = n*c1 + n*c5 + n*2*c2 + (x'+y')*c3 +n*c6 = n*(c1+c5+c6) + 2n*c2 + (x'+y' )*c3

如果最大值和最小值是数据的前两个元素，x=x'=1;y=y'=1。

T1-T2 = n*c1 + 2n*c4 - n*c5 -n*c6。不同的编译器，T1-T2 可能不同。

更复杂的是 x,y,x',y' 是可变的，但我不会对此进行进一步讨论。我的目的是展示真正的运行时间成本。

如果你读了上面的内容后仍然感到困惑，你应该阅读算法介绍的第 2.2 章。

score 0 · Accepted Answer

首先，请原谅我在一个答案中回答所有问题。我知道我不应该在 stackoverflow.com 上这样做，但考虑到不同的主题或多或少交织在一起，这样会更容易。

介绍

所以，这是我现在用来测试算法的代码。与之前版本的区别：

它包括您通过命令行参数选择的两个版本；
@chr，@Dukeling：它 mmap(2)s 文件以防止系统调用或库调用；
@chr，@Dukeling：它有一个可选的“预热”选项，可以在运行所选算法之前将所有页面故障到内存中；
@Dukeling：程序将使用 gettimeofday(2)仅记录算法本身所花费的时间。

这是代码：

#include <sys/mman.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define ROUNDUP(x, y)   ((((x) + ((y) - 1)) / (y)) * (y))
#define USCARRY(x)      ((x) < 0 ? (x) + 1000000 : (x))

int
main(int ac, char *av[])
{
        struct stat st;
        struct timeval start, end;
        long count, min, max, pgsiz;
        long *n, *p, *endp;
        int fd, warmup;

        if (ac != 3 && ac != 4) {


              fprintf(stderr, "Usage: %s <\"trivial\"|\"faster\"> "
                    "<file> [warmup]\n", av[0]);
                exit(1);
        }

        fd = open(av[2], O_RDONLY);
        if (fd == -1)
                perror("open");
        fstat(fd, &st);
        pgsiz = sysconf(_SC_PAGESIZE);
        n = mmap(NULL, ROUNDUP(st.st_size, pgsiz), PROT_READ,
            MAP_SHARED, fd, 0);
        if (n == MAP_FAILED)
                perror("mmap");
        warmup = 0;
        if (ac == 4)
                warmup = atoi(av[3]);
        // warm up the filesystem cache
        count = st.st_size / sizeof (*n);
        endp = &n[count - 1];
        //printf("%zu\n", sizeof (*p));
        //printf("%zu\n", sizeof (count));
        //exit(0);
        if (warmup) {
                for (p = n; p <= endp; p++) {
                        fwrite(p, sizeof (*p), 1, stdout);
                        min = *p;
                }
        }
        // start algorithm
        gettimeofday(&start, NULL);
        if (!strcmp(av[1], "trivial")) {
                min = max = n[0];
                p = &n[1];
                while (p <= endp) {                     // c1 * n
                        if (p[0] < min)                 // c2 * n
                                min = p[0];             // c3 * x
                        if (p[0] > max)                 // c2 * n
                                max = p[0];             // c3 * y
                        p++;                            // c4 * n
                }                       
        } else if (!strcmp(av[1], "faster")) {
                min = max = n[0];
                p = &n[1];
                while (p < endp) {                      // c1 * n/2
                        if (p[0] < p[1]) {              // c2 * n/2
                                if (p[0] < min)         // c2 * n/4
                                        min = p[0];     // c3 * x/2
                                if (p[1] > max)         // c2 * n/4
                                        max = p[1];     // c3 * y/2
                        } else {
                                if (p[1] < min)         // c2 * n/4
                                        min = p[1];     // c3 * x/2
                                if (p[0] > max)         // c2 * n/4
                                        max = p[0];     // c3 * y/2
                        }
                        p += 2;                         // c5 * n
                }                                       
                if (p == endp) {
                        if (*endp < min)
                                min = *endp;
                        else if (*endp > max)
                                max = *endp;
                }
        } else {
                printf("sorry\n");
                exit(1);
        }
        gettimeofday(&end, NULL);
        printf("time: %ld.%ld\n", end.tv_sec - start.tv_sec,
            USCARRY(end.tv_usec - start.tv_usec));
        printf("count: %ld\n", count);
        printf("min: %ld\n", min);
        printf("max: %ld\n", max);
        exit(0);
}

测试用例

以下是我用于测试用例的文件：

$ ls -l _*
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:37 _bestcase
-rw-r--r-- 1 jlh jlh 2400000000 May 27 08:40 _random
-rw-r--r-- 1 jlh jlh 2400000000 May 27 23:38 _worstcase

$ od -N 64 -t dL _bestcase 
0000000                    0            300000000
0000020                    1            299999999
0000040                    2            299999998
0000060                    3            299999997
0000100

$ od -N 64 -t dL _random
0000000  8600270969454287374  8436406000964321037
0000020  7348498334162074664  2050332610730417325
0000040  8183771519386464726  4134520779774161130
0000060  2045475555785071180  2859007060406233926
0000100

$ od -N 64 -t dL _worstcase 
0000000            150000000            150000000
0000020            149999999            150000001
0000040            149999998            150000002
0000060            149999997            150000003
0000100

I/O 惩罚

好的，首先让我们预热缓存并验证没有丢失页面，然后可能会搞砸结果：

$ ./findminmax trivial _random 
time: 3.543573
count: 300000000
min: 31499144704
max: 9223372004409096866

$ ./findminmax trivial _random 
time: 1.466323
count: 300000000
min: 31499144704
max: 9223372004409096866

$ perf stat -e  minor-faults,major-faults ./findminmax trivial _random 
time: 1.284729
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax trivial _random':

           586,066 minor-faults                                                
                 0 major-faults                                                

       1.350118552 seconds time elapsed

如您所见，没有严重的页面错误。从现在开始，我们可以理所当然地认为不会有 I/O 影响。2.指令数

指令计数和分支未命中

@H2CO3，@vvy，您对其他指令也很重要的事实是绝对正确的（我会考虑每个操作在这里占用相同数量的 CPU 周期，并且会导致 appart CPU 缓存未命中）。我不知道为什么到目前为止我读到的关于算法的文献只关注比较的次数（好吧，我承认我读的不多；））。

我已经用我自己对平均情况的估计来评论循环中的每个步骤（我认为最坏的情况在这里并不有趣），这对你来说略有不同。

如果我没记错的话： - 对于普通版本：n * (c1 + 2*c2 + c4) + (x + y) * c3 - 对于更快的版本：n/2 * (c1 + 3*c2 + c5 ) + (x + y) * c3

现在根据我的理解，很难进一步推测每个 cN 需要多少 CPU 周期，因为它因 CPU 而异。

让我们检查一下我的计算机上发生了多少指令、分支和分支未命中，大部分是空闲的，而每个算法在每个测试用例上执行，并带有一个暖缓存（注意我每个用例测试了 3 次以验证没有明显偏差）：

随机案例

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _random
time: 1.547087
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax_stackoverflow trivial _random':

     1,083,101,126 branches                                                    
            52,388 branch-miss

     4,335,175,257 instructions              #    0.00  insns per cycle        

       1.623851849 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _random
time: 2.748967
count: 300000000
min: 31499144704
max: 9223372004409096866

 Performance counter stats for './findminmax_stackoverflow faster _random':

       783,120,927 branches                                                    
        75,063,008 branch-miss

     3,735,286,264 instructions              #    0.00  insns per cycle        

       1.824884443 seconds time elapsed

请注意，对于更快的版本，我们的指令更少，但运行时间更长，退出可能是因为有更多的分支未命中，数量级或数量级！

最佳案例

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _bestcase 
time: 1.267697
count: 300000000
min: 0
max: 300000000

 Performance counter stats for './findminmax_stackoverflow trivial _bestcase':

     1,082,801,759 branches                                                    
            49,802 branch-miss

     4,334,200,448 instructions              #    0.00  insns per cycle        

       1.343425753 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _bestcase 
time: 0.957440
count: 300000000
min: 0
max: 300000000

 Performance counter stats for './findminmax_stackoverflow faster _bestcase':

       782,844,232 branches                                                    
            49,768 branch-miss

     3,734,103,167 instructions              #    0.00  insns per cycle        

       1.035189822 seconds time elapsed

最坏的情况下

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _worstcase 
time: 7.860047
count: 300000000
min: 1
max: 299999999

 Performance counter stats for './findminmax_stackoverflow trivial _worstcase':

     1,490,947,270 branches                                                    
         2,127,876 branch-miss

     7,159,600,607 instructions              #    0.00  insns per cycle        

       6.916856158 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _worstcase 
time: 7.616476
count: 300000000
min: 1
max: 299999999

 Performance counter stats for './findminmax_stackoverflow faster _worstcase':

     1,196,744,455 branches                                                    
         2,024,250 branch-miss

     6,594,182,002 instructions              #    0.00  insns per cycle        

       6.675068846 seconds time elapsed

因此，非常有趣的是，“随机”情况实际上比最坏情况更快，这并没有太大区别。我看到的唯一区别是我最坏的情况包含“小”数字（可以编码为 32 位），而随机情况包含真正的 64 位数字。

“小”整数的随机情况

因此，让我们尝试使用一组“小”随机数（仍以 64 位编码存储）：

$ od -N 64 -t dL _randomsmall 
0000000           1418331637           2076047555
0000020             22077878           1677970822
0000040           1845481621            609558726
0000060           1668260452            335112094
0000100

$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow trivial _randomsmall 
time: 7.682443
count: 300000000
min: 9
max: 2147483647

 Performance counter stats for './findminmax_stackoverflow trivial _randomsmall':

     1,481,062,942 branches                                                    
         2,564,853 branch-miss

     6,223,311,378 instructions              #    0.00  insns per cycle        

       6.739897078 seconds time elapsed


$ perf stat -e branches,branch-miss,instructions ./findminmax_stackoverflow faster _randomsmall 
time: 7.772994
count: 300000000
min: 9
max: 2147483647

 Performance counter stats for './findminmax_stackoverflow faster _randomsmall':

     1,177,042,675 branches                                                    
        77,686,346 branch-miss

     5,607,194,799 instructions              #    0.00  insns per cycle        

       6.834074812 seconds time elapsed

所以，正如我猜想的那样，处理小数字实际上比处理大数字慢，即使它们都包含 64 位字。有更多的“小”数字分支未命中，原因可能只有 CPU 设计人员才能说出来:-)。

另一件值得注意的事情是，由 perf(1) 测量的经过时间通常有时小于由程序本身测量的时间。我认为这是因为程序本身使用实时，而 perf(1) 使用进程时间（进程实际运行的时间）。我尝试使用 getrusage(2)，但我到达这里的时间不匹配（例如，我得到 1.6s 作为用户时间，1.4s 作为系统时间，而 perf(1) 测量为 6.8s）。

结论

两种算法在执行时间方面没有太大的实际差异，尽管微不足道的缓存未命中率（数量级）比“更快”的算法少得多，但这似乎可以通过增加的指令数量来平衡（ 10-20%）；
更具体地说，这种缓存未命中差异只能在随机情况下看到：最好和最坏的情况似乎在这方面退化了，因为它们导致两种算法的缓存未命中数量相同；因此，在这些情况下，“更快”的算法确实比普通的算法快一点；在“常见”随机情况下，平凡算法要快一点；
小整数会在我的 CPU 上产生更多的缓存未命中；
最坏情况和小 int 的随机测试用例之间没有真正的区别。

所以，如果你能做到这一点，谢谢你:)。我很抱歉，虽然这所有的实验只导致了模糊的结论。希望开明的读者能对此有所了解:)。

——杰瑞米

score 0 · Accepted Answer

我能想到两件事：

它正在与分支预测器玩地狱。
天真的版本正在被编译器自动矢量化，而“聪明”的版本则不是。