c++ - 为什么 GCC 不使用 LOAD（无围栏）和 STORE+SFENCE 来实现顺序一致性？

Question

以下是在 x86/x86_64 中实现顺序一致性的四种方法：

LOAD（无围栏）和 STORE+MFENCE
LOAD（无围栏）和LOCK XCHG
MFENCE+LOAD 和 STORE（不带围栏）
LOCK XADD(0) 和 STORE（无栅栏）

正如这里所写：http: //www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

C/C++11 操作 x86 实现

加载 Seq_Cst：MOV（从内存中）

Store Seq Cst: (LOCK) XCHG // 替代： MOV (入内存),MFENCE

注意：有一个 C/C++11 到 x86 的替代映射，而不是锁定（或隔离） Seq Cst 存储锁定/隔离 Seq Cst 负载：

加载 Seq_Cst: LOCK XADD(0) // 备选方案：MFENCE,MOV（从内存中）

Store Seq Cst: MOV (入内存)

GCC 4.8.2（x86_64 中的 GDB ）对C++11-std::memory_order_seq_cst使用 first(1) 方法，即 LOAD（无围栏）和 STORE+MFENCE：

std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8  <+0x0058>         mov    0x38(%rsp),%eax
0x4613ec  <+0x005c>         mov    %eax,0x20(%rsp)
0x4613f0  <+0x0060>         mfence

众所周知，MFENCE = LFENCE+SFENCE。然后这段代码我们可以重写为：LOAD(without fence) and STORE+LFENCE+SFENCE

问题：

为什么我们这里不需要在 LOAD 之前使用 LFENCE，而需要在 STORE 之后使用 LFENCE（因为 LFENCE 只有在 LOAD 之前才有意义！）？
为什么 GCC 不使用方法：对于 std::memory_order_seq_cst 的加载（无围栏）和 STORE+SFENCE？

score 6 · Accepted Answer

x86 唯一的重新排序（对于正常的内存访问）是它可能会重新排序存储之后的加载。

SFENCE 保证围墙前的所有店铺在围墙后的所有店铺之前完成。LFENCE 保证围栏之前的所有负载在围栏之后的所有负载之前完成。对于正常的内存访问，默认情况下已经提供了单个 SFENCE 或 LFENCE 操作的排序保证。基本上，LFENCE 和 SFENCE 本身只对 x86 较弱的内存访问模式有用。

LFENCE、SFENCE 和 LFENCE + SFENCE 都不能防止存储后跟负载被重新排序。MFENCE 可以。

相关参考资料是 Intel x86 架构手册。

score 6 · Accepted Answer

考虑以下代码：

#include <atomic>
#include <cstring>

std::atomic<int> a;
char b[64];

void seq() {
  /*
    movl    $0, a(%rip)
    mfence
  */
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
}

void rel() {
  /*
    movl    $0, a(%rip)
   */
  int temp = 0;
  a.store(temp, std::memory_order_relaxed);
}

关于原子变量“a”，seq() 和 rel() 在 x86 架构上都是有序且原子的，因为：

mov 是原子指令
mov 是一条遗留指令，英特尔承诺遗留指令的有序内存语义与总是使用有序内存语义的旧处理器兼容。

将常量值存储到原子变量中不需要栅栏。栅栏在那里是因为 std::memory_order_seq_cst 意味着所有内存都是同步的，而不仅仅是保存原子变量的内存。

效果可以通过以下 set 和 get 函数来演示：

void set(const char *s) {
  strcpy(b, s);
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
}

const char *get() {
  int temp = 0;
  a.store(temp, std::memory_order_seq_cst);
  return b;
}

strcpy 是一个库函数，如果在运行时可用，它可能会使用更新的 sse 指令。由于 sse 指令在旧处理器中不可用，因此对向后兼容性没有要求，并且内存顺序未定义。因此，一个线程中的 strcpy 的结果可能不会在其他线程中直接可见。

上面的 set 和 get 函数使用原子值来强制执行内存同步，以便 strcpy 的结果在其他线程中可见。现在栅栏很重要，但是它们在 atomic::store 调用中的顺序并不重要，因为在 atomic::store 内部不需要栅栏。

score 5 · Accepted Answer

std::atomic<int>::store被映射到编译器内在__atomic_store_n。（这里记录了这个和其他原子操作内在函数：用于内存模型感知原子操作的内置函数。）_n后缀使它成为类型通用的；后端实际上实现了以字节为单位的特定大小的变体。 int在 x86 上，AFAIK 始终为 32 位长，这意味着我们正在寻找__atomic_store_4. 这个版本的 GCC 的内部手册说，这些__atomic_store操作对应于名为;的机器描述模式。对应于 4 字节整数的模式是“SI”（在此处记录），因此我们正在寻找名为“atomic_store‌modeatomic_storesi" 在 x86 机器描述中。这将我们带到config/i386/sync.md，特别是这一位：

(define_expand "atomic_store<mode>"
  [(set (match_operand:ATOMIC 0 "memory_operand")
        (unspec:ATOMIC [(match_operand:ATOMIC 1 "register_operand")
                        (match_operand:SI 2 "const_int_operand")]
                       UNSPEC_MOVA))]
  ""
{
  enum memmodel model = (enum memmodel) (INTVAL (operands[2]) & MEMMODEL_MASK);

  if (<MODE>mode == DImode && !TARGET_64BIT)
    {
      /* For DImode on 32-bit, we can use the FPU to perform the store.  */
      /* Note that while we could perform a cmpxchg8b loop, that turns
         out to be significantly larger than this plus a barrier.  */
      emit_insn (gen_atomic_storedi_fpu
                 (operands[0], operands[1],
                  assign_386_stack_local (DImode, SLOT_TEMP)));
    }
  else
    {
      /* For seq-cst stores, when we lack MFENCE, use XCHG.  */
      if (model == MEMMODEL_SEQ_CST && !(TARGET_64BIT || TARGET_SSE2))
        {
          emit_insn (gen_atomic_exchange<mode> (gen_reg_rtx (<MODE>mode),
                                                operands[0], operands[1],
                                                operands[2]));
          DONE;
        }

      /* Otherwise use a store.  */
      emit_insn (gen_atomic_store<mode>_1 (operands[0], operands[1],
                                           operands[2]));
    }
  /* ... followed by an MFENCE, if required.  */
  if (model == MEMMODEL_SEQ_CST)
    emit_insn (gen_mem_thread_fence (operands[2]));
  DONE;
})

无需赘述，其中大部分是一个 C 函数体，将调用该函数体来生成原子存储操作的低级“ RTL ”中间表示。当您的示例代码调用它时，<MODE>mode != DImode, model == MEMMODEL_SEQ_CST, andTARGET_SSE2为真，因此它将调用gen_atomic_store<mode>_1and then gen_mem_thread_fence。后一个函数总是生成mfence. （此文件中有要生成的代码sfence，但我相信它仅用于显式编码_mm_sfence（来自<xmmintrin.h>）。）

评论表明有人认为在这种情况下需要 MFENCE。我的结论是，要么您错误地认为不需要加载围栏，要么这是 GCC 中错过的优化错误。例如，这不是您使用编译器的错误。

score 5 · Accepted Answer

SFENCE + LFENCE不是StoreLoad屏障（MFENCE），所以问题的前提是不正确的。（另请参阅我对来自同一用户的同一问题的另一个版本的回答为什么（或不是？）SFENCE + LFENCE 等同于 MFENCE？）

SFENCE 可以传递（出现在之前）较早的加载。（这只是一个 StoreStore 障碍）。
LFENCE 可以通过较早的商店。（负载不能在任一方向穿过它：LoadLoad 屏障）。
加载可以通过 SFENCE（但存储不能通过 LFENCE，因此它是 LoadStore 屏障和 LoadLoad 屏障）。

LFENCE+SFENCE 不包含任何阻止存储在稍后加载之前被缓冲的内容。MFENCE确实可以防止这种情况。

Preshing 的博客文章更详细地解释了 StoreLoad 屏障的特殊之处，并附有图表，并提供了一个实际的工作代码示例，演示了没有 MFENCE 的重新排序。任何对内存排序感到困惑的人都应该从该博客开始。

x86 有一个强大的内存模型，每个普通存储都有释放语义，每个普通加载都有获取语义。这篇文章有详细信息。

LFENCE 和 SFENCE仅存在用于movnt加载/存储，它们是弱排序的以及绕过缓存。

万一这些链接消失了，我对另一个类似问题的回答中会提供更多信息。

c++ - 为什么 GCC 不使用 LOAD（无围栏）和 STORE+SFENCE 来实现顺序一致性？

4 回答 4

Related

Reference