c - 虚假共享和 pthreads

Question

我有以下任务来演示虚假共享并编写了一个简单的程序：

#include <sys/times.h>
#include <time.h>
#include <stdio.h> 
#include <pthread.h> 

long long int tmsBegin1,tmsEnd1,tmsBegin2,tmsEnd2,tmsBegin3,tmsEnd3;

int array[100];

void *heavy_loop(void *param) { 
  int   index = *((int*)param);
  int   i;
  for (i = 0; i < 100000000; i++)
    array[index]+=3;
} 

int main(int argc, char *argv[]) { 
  int       first_elem  = 0;
  int       bad_elem    = 1;
  int       good_elem   = 32;
  long long time1;
  long long time2;
  long long time3;
  pthread_t     thread_1;
  pthread_t     thread_2;

  tmsBegin3 = clock();
  heavy_loop((void*)&first_elem);
  heavy_loop((void*)&bad_elem);
  tmsEnd3 = clock();

  tmsBegin1 = clock();
  pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem);
  pthread_create(&thread_2, NULL, heavy_loop, (void*)&bad_elem);
  pthread_join(thread_1, NULL);
  pthread_join(thread_2, NULL);
  tmsEnd1 = clock(); 

  tmsBegin2 = clock();
  pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem);
  pthread_create(&thread_2, NULL, heavy_loop, (void*)&good_elem);
  pthread_join(thread_1, NULL);
  pthread_join(thread_2, NULL);
  tmsEnd2 = clock();

  printf("%d %d %d\n", array[first_elem],array[bad_elem],array[good_elem]);
  time1 = (tmsEnd1-tmsBegin1)*1000/CLOCKS_PER_SEC;
  time2 = (tmsEnd2-tmsBegin2)*1000/CLOCKS_PER_SEC;
  time3 = (tmsEnd3-tmsBegin3)*1000/CLOCKS_PER_SEC;
  printf("%lld ms\n", time1);
  printf("%lld ms\n", time2);
  printf("%lld ms\n", time3);

  return 0; 
}

当我看到结果时我非常惊讶（我在 i5-430M 处理器上运行它）。

使用虚假共享，它是 1020 毫秒。
如果没有虚假共享，则为 710 毫秒，仅快 30% 而不是 300%（在某些网站上写它会比 300-400% 快）。
不使用 pthreads，它是 580 毫秒。

请告诉我我的错误或解释它发生的原因。

score 23 · Accepted Answer

错误共享是具有单独缓存的多个内核访问同一物理内存区域的结果（尽管不是相同的地址——这将是真正的共享）。

要了解虚假共享，您需要了解缓存。在大多数处理器中，每个内核都有自己的 L1 缓存，用于保存最近访问的数据。缓存按“行”组织，这些“行”是对齐的数据块，长度通常为 32 或 64 个字节（取决于您的处理器）。当您从不在缓存中的地址读取时，整行将从主内存（或 L2 缓存）读取到 L1。当您写入缓存中的地址时，包含该地址的行被标记为“脏”。

这就是共享方面的用武之地。如果多个核心从同一行读取，它们每个都可以在 L1 中拥有该行的副本。但是，如果一个副本被标记为脏，它会使其他缓存中的行无效。如果这没有发生，那么在一个内核上进行的写入可能直到很久以后才会对其他内核可见。所以下次另一个核心从那条线读取时，缓存未命中，它必须再次获取该线。

当内核在同一行上读取和写入不同地址时，就会发生错误共享。即使它们不共享数据，缓存的行为也像它们一样，因为它们非常接近。

这种效果高度依赖于处理器的架构。如果你有一个单核处理器，你根本看不到效果，因为没有共享。如果您的缓存行更长，您会在“坏”和“好”情况下看到效果，因为它们仍然靠得很近。如果您的内核没有共享 L2 缓存（我猜他们确实如此），您可能会看到 300-400% 的差异，因为它们必须在缓存未命中时一直运行到主内存。

您可能还想知道每个线程都在读取和写入（+= 而不是 =）很重要。一些处理器具有直写缓存，这意味着如果核心写入不在缓存中的地址，它不会错过并从内存中获取行。将此与回写缓存进行对比，后者确实会丢失写入。

score 4 · Accepted Answer

C 语言中的clock() 函数简介：它为您提供从开始到结束所经过的CPU 时钟数。因此，当您运行两个并行线程时，CPU 周期数将是 CPU1 的时钟周期 + CPU2 的时钟周期。

我想你想要的是一个真正的计时器。为此用途

时钟获取时间（）

你应该得到预期的输出。

我用clock_gettime()运行了你的代码，我得到了这个：

虚假共享 874.587381 ms
无虚假共享 331.844278 ms
顺序计算 604.160276 ms

c - 虚假共享和 pthreads

2 回答 2

Related

Reference