3

我正在尝试使用 C 程序来理解 CPU 缓存和缓存行,就像我对大多数 C 概念所做的那样。我使用的程序如下所示。我从博客中得到了这个想法。

http://igoro.com/archive/gallery-of-processor-cache-effects/

现在我机器上以下程序的输出如下所示。这是 CFLAGS="-g -O0 -Wall" 的输出。

./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

优化后的输出 ( CFLAGS=-g -O3 -Wall)

CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

博客中指出

第一个循环将数组中的每个值乘以 3,第二个循环仅每 16 次乘以 >。第二个循环只完成了第一个循环的 6% 左右的工作,但在现代机器上,两个 for 循环所用的时间大致相同:在我的机器上分别为 80 毫秒和 78 毫秒。

在我的机器上似乎不是这种情况。如您所见,执行的时间

loop 1 is 0.46 seconds.

那对于

loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds

对于不同的 j 值。

为什么会这样?

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (64*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
     It lives in a fixed-size buffer, set by the program executable's configuration according to the
     operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */
    start = clock();

    /* Loop 1 */
    for (i = 0; i < MAX_SIZE; i++)
        arr[i] *= 3;

    /* CPU clock ticks count stop */
    end = clock();

    cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("CPU time for loop 1 %.6f secs.\n", cpu_time);

    for (j = 8 ; j < 32 ; j++)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] *= 3;

        /* CPU clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }

    return 0;
}
4

1 回答 1

3

我稍微修改了代码。首先是修改总结:

  1. 使 MAX_SIZE 显着变大,以确保在事情发生变化时存在真正的差异。(它现在确实使用了 2 GB 的内存,所以不要在 32 位操作系统上这样做)
  2. 运行循环 1 几次(在我的机器上,这会有所不同,因为我的机器第一次运行速度会变慢 - 这可能是因为malloc实际上并没有将内存映射到进程的地址空间,所以在第一个循环,我们得到了一些额外的内存映射开销)。它还确保 CPU 在执行其他循环时以“全速”运行,而不是“省电”速度。
  3. j通过乘以 2 在第二个循环中更快地改变值(与本例<<= 1相同*= 2- 使用 shift 的旧习惯)
  4. 使用+= 3而不是*= 3. (乘法比 += 慢一点,但在这种情况下差别不大。
  5. 添加一个loop3与 loop2 执行完全相同数量的操作,但在较小的内存范围内[使用&2 n -1 值来限制范围]。

我使用 4.6.3 版本编译了代码gcc -Wall -O3 -sdc=c99,并在四核 Athlon 965、Fedora Core 16 x86-64 和 16 GB RAM 上运行。

这是代码:

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (512*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
       It lives in a fixed-size buffer, set by the program executable's configuration according to the
       operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */

    for(int k = 0; k < 3; k++)
    {
        start = clock();

        /* Loop 1 */
        for (i = 0; i < MAX_SIZE; i++)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
    }

    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }


    // Third loop, performing the same operations as loop 2,
    // but only touching 16KB of memory
    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 3 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i & 0xfff] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
    }
    return 0;
}

结果:

CPU time for loop 1 2.950000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 2 (j = 1) 0.780000 secs.
CPU time for loop 2 (j = 2) 0.700000 secs.
CPU time for loop 2 (j = 4) 0.610000 secs.
CPU time for loop 2 (j = 8) 0.540000 secs.
CPU time for loop 2 (j = 16) 0.560000 secs.
CPU time for loop 2 (j = 32) 0.280000 secs.
CPU time for loop 2 (j = 64) 0.140000 secs.
CPU time for loop 2 (j = 128) 0.090000 secs.
CPU time for loop 2 (j = 256) 0.060000 secs.
CPU time for loop 2 (j = 512) 0.030000 secs.
CPU time for loop 2 (j = 1024) 0.040000 secs.
CPU time for loop 3 (j = 1) 0.470000 secs.
CPU time for loop 3 (j = 2) 0.240000 secs.
CPU time for loop 3 (j = 4) 0.120000 secs.
CPU time for loop 3 (j = 8) 0.050000 secs.
CPU time for loop 3 (j = 16) 0.030000 secs.
CPU time for loop 3 (j = 32) 0.020000 secs.
CPU time for loop 3 (j = 64) 0.010000 secs.
CPU time for loop 3 (j = 128) 0.000000 secs.
CPU time for loop 3 (j = 256) 0.000000 secs.
CPU time for loop 3 (j = 512) 0.000000 secs.
CPU time for loop 3 (j = 1024) 0.000000 secs.

正如你所看到的,前几个loop2需要相同的时间 - 一旦我们达到 32 时间开始下降,因为处理器不需要每个缓存线,但在这种loop3情况下,操作的数量在每个循环中相当直接地影响总时间。

编辑:

乘法 ( *=3) 与加法 ( +=3) 并没有太大的区别,除了在 loop3 的情况下,它增加了大约 30% 的循环时间。

于 2013-06-16T23:53:22.003 回答