x86 - 在检查错误共享时，为什么使用同级线程运行时比使用独立线程运行时有更多的 L1d 缓存未命中

Question

（我知道过去有人问过一些相关的问题，但我找不到关于 L1d 缓存未命中和超线程/SMT 的问题。）

在阅读了几天关于虚假共享、MESI/MOESI 缓存一致性协议等一些超级有趣的东西之后，我决定用 C 编写一个小的“基准”（见下文），以测试虚假共享的实际效果。

我基本上有一个包含 8 个双精度数的数组，因此它适合一个缓存行和两个递增相邻数组位置的线程。

在这一点上，我应该声明我正在使用 Ryzen 5 3600，其拓扑结构可以在这里看到。

我创建了两个线程，然后将它们固定在两个不同的逻辑核心上，每个核心都访问并更新它自己的数组位置，即线程 A 更新数组 [2] 和线程 B 更新数组 [3] 。

当我使用属于同一内核的硬件线程#0和#6运行代码时（如拓扑图所示）共享 L1d 缓存，执行时间约为 5 秒。

当我使用没有任何共同缓存的线程#0和#11时，大约需要 9.5 秒才能完成。该时间差是预期的，因为在这种情况下，“缓存线乒乓球”正在进行。

但是，这让我感到困扰，当我使用 Threads #0和#11时，L1d 缓存未命中少于使用 Threads #0和#6运行。

我的猜测是，当使用没有公共缓存的线程#0和#11时，当一个线程更新共享缓存行的内容时，根据 MESI/MOESI 协议，另一个核心中的缓存行会失效。因此，即使正在进行乒乓球，也不会发生太多缓存未命中（与使用线程#0和#6运行时相比），只是在内核之间发生了一堆无效和缓存行块传输。

那么，当使用具有公共 L1d 缓存的线程 #0 和 #6 时，为什么会有更多的缓存未命中？

（线程#0和#6也有公共的 L2 缓存，但我认为它在这里没有任何重要性，因为当缓存行失效时，它必须从主内存（MESI）或另一个核心的缓存（MOESI），因此 L2 似乎不可能拥有所需的数据，但也被要求提供）。

当然，当一个线程写入 L1d 缓存行时，缓存行会变得“脏”，但这有什么关系呢？驻留在同一物理核心上的其他线程不应该没有问题读取新的“脏”值吗？

TLDR：在测试 False Sharing时，使用两个同级线程（属于同一物理内核的线程）时，L1d 缓存未命中率大约是使用属于两个不同物理内核中的线程时的3 倍。（2.34% 对 0.75% 的未命中率，3.96 亿对 1.18 亿的绝对未命中数）。为什么会这样？

（L1d 缓存未命中等所有统计数据都是使用 Linux 中的 perf 工具测量的。）

另外，次要问题，为什么兄弟线程在 ID 6 数字中配对？即线程 0 的兄弟是线程 6。线程 i 的兄弟是线程 i+6。这有什么帮助吗？我在 Intel 和 AMD CPU 中都注意到了这一点。

我对计算机体系结构非常感兴趣，我还在学习，所以上面的一些可能是错误的，很抱歉。

所以，这是我的代码。只需创建两个线程，将它们绑定到特定的逻辑核心，然后访问相邻的缓存行位置。

#define _GNU_SOURCE

#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/random.h>
#include <time.h>
#include <pthread.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

struct timespec tstart, tend;
static cpu_set_t cpuset;


typedef struct arg_s
{
       int index;
       double *array_ptr;
} arg_t;

void *work(void *arg)
{
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    int array_index = ((arg_t*)arg)->index;
    double *ptr = ((arg_t*)arg)->array_ptr;

    for(unsigned long i=0; i<1000000000; i++)
    {
            //it doesn't matter which of these is used
            // as long we are hitting adjacent positions
            ptr[array_index] ++;
            // ptr[array_index] += 1.0e5 * 4;
    }
    return NULL;
}

int main()
{
    pthread_t tid[2];

    srand(time(NULL));
    
    static int cpu0 = 0;
    static int cpu6 = 6; //change this to say 11 to run with threads 0 and 11

    CPU_ZERO(&cpuset);
    CPU_SET(cpu0, &cpuset);
    CPU_SET(cpu6, &cpuset);

    double array[8];

    for(int i=0; i<8; i++)
            array[i] = drand48();

    arg_t *arg0 = malloc(sizeof(arg_t));
    arg_t *arg1 = malloc(sizeof(arg_t));

    arg0->index = 0; arg0->array_ptr = array;       
    arg1->index = 1; arg1->array_ptr = array;


    clock_gettime(CLOCK_REALTIME, &tstart);

    pthread_create(&tid[0], NULL, work, (void*)arg0);
    pthread_create(&tid[1], NULL, work, (void*)arg1);

    pthread_join(tid[0], NULL);
    pthread_join(tid[1], NULL);


    clock_gettime(CLOCK_REALTIME, &tend);
 }

我正在使用 GCC 10.2.0 编译为gcc -pthread p.c -o p

perf record ./p --cpu=0,6然后在分别使用线程 0,6 和 0,11 时使用 --cpu=0,11运行或相同的东西。

然后在另一种情况下运行perf stat -d ./p --cpu=0,6或与 --cpu=0,11 相同

使用线程0和6运行：

Performance counter stats for './p --cpu=0,6':

           9437,29 msec task-clock                #    1,997 CPUs utilized          
                64      context-switches          #    0,007 K/sec                  
                 2      cpu-migrations            #    0,000 K/sec                  
               912      page-faults               #    0,097 K/sec                  
       39569031046      cycles                    #    4,193 GHz                      (75,00%)
        5925158870      stalled-cycles-frontend   #   14,97% frontend cycles idle     (75,00%)
        2300826705      stalled-cycles-backend    #    5,81% backend cycles idle      (75,00%)
       24052237511      instructions              #    0,61  insn per cycle         
                                                  #    0,25  stalled cycles per insn  (75,00%)
        2010923861      branches                  #  213,083 M/sec                    (75,00%)
            357725      branch-misses             #    0,02% of all branches          (75,03%)
       16930828846      L1-dcache-loads           # 1794,034 M/sec                    (74,99%)
         396121055      L1-dcache-load-misses     #    2,34% of all L1-dcache accesses  (74,96%)
   <not supported>     LLC-loads                                                   
   <not supported>     LLC-load-misses                                             

       4,725786281 seconds time elapsed

       9,429749000 seconds user
       0,000000000 seconds sys

使用线程0和11运行：

 Performance counter stats for './p --cpu=0,11':

          18693,31 msec task-clock                #    1,982 CPUs utilized          
               114      context-switches          #    0,006 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
               903      page-faults               #    0,048 K/sec                  
       78404951347      cycles                    #    4,194 GHz                      (74,97%)
        1763001213      stalled-cycles-frontend   #    2,25% frontend cycles idle     (74,98%)
       71054052070      stalled-cycles-backend    #   90,62% backend cycles idle      (74,98%)
       24055983565      instructions              #    0,31  insn per cycle         
                                                  #    2,95  stalled cycles per insn  (74,97%)
        2012326306      branches                  #  107,650 M/sec                    (74,96%)
            553278      branch-misses             #    0,03% of all branches          (75,07%)
       15715489973      L1-dcache-loads           #  840,701 M/sec                    (75,09%)
         118455010      L1-dcache-load-misses     #    0,75% of all L1-dcache accesses  (74,98%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

       9,430223356 seconds time elapsed

      18,675328000 seconds user
       0,000000000 seconds sys

x86 - 在检查错误共享时，为什么使用同级线程运行时比使用独立线程运行时有更多的 L1d 缓存未命中

0 回答 0

Related

Reference