c++ - 按需条件 std::atomic_thread_fence 获取的优缺点？

Question

下面的代码显示了通过原子标志获取共享状态的两种方法。读取器线程调用poll1()或poll2()检查写入器是否已发出标志。

投票选项#1：

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

投票选项#2：

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

请注意，选项 #1在较早的问题中提出，选项 #2 类似于cppreference.com 上的示例代码。

假设读者同意仅在poll函数返回时检查共享状态true，这两个poll函数是否正确且等价？

选项#2 有标准名称吗？

每个选项的优点和缺点是什么？

选项#2 在实践中可能更有效吗？是否有可能降低效率？

这是一个完整的工作示例：

#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>

int x; // regular variable, could be a complex data structure

std::atomic<int> flag { 0 };

void writer_thread() {
    x = 42;
    // release value x to reader thread
    flag.store(1, std::memory_order_release);
}

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

int main() {
    x = 0;

    std::thread t(writer_thread);

    // "reader thread" ...  
    // sleep-wait is just for the test.
    // production code calls poll() at specific points

    while (!poll2()) // poll1() or poll2() here
      std::this_thread::sleep_for(std::chrono::milliseconds(50));

    std::cout << x << std::endl;

    t.join();
}

score 2 · Accepted Answer

我想我可以回答你的大部分问题。

这两个选项当然都是正确的，但它们并不完全等同，因为独立栅栏的适用性略广（它们在您想要完成的方面是等效的，但独立栅栏在技术上可以适用于其他事情，如好吧——想象一下如果这段代码是内联的）。Jeff Preshing在这篇文章中解释了独立围栏与存储/获取围栏有何不同的示例。

据我所知，选项#2 中的 check-then-fence 模式没有名称。不过，这并不少见。

在性能方面，在 x64 (Linux) 上使用我的 g++ 4.8.1 时，两个选项生成的程序集归结为一条加载指令。这并不奇怪，因为 x86(-64) 加载和存储都在硬件级别具有获取和释放语义（x86 以其相当强大的内存模型而闻名）。

但是，对于 ARM，内存屏障编译为实际的单个指令，会产生以下输出（使用gcc.godbolt.com和-O3 -DNDEBUG）：

对于while (!poll1());：

.L25:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    dmb     sy
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L25

对于while (!poll2());：

.L29:
    ldr     r0, [r2]
    movw    r3, #:lower16:.LANCHOR0
    movt    r3, #:upper16:.LANCHOR0
    cmp     r0, #1
    bne     .L29
    dmb     sy

您可以看到唯一的区别是同步指令 ( dmb) 的放置位置 - 在 for 循环内部poll1和 for 之后poll2。所以poll2在这个现实世界的情况下确实更有效:-)（但请继续阅读，了解为什么如果在循环中调用它们以阻塞直到标志发生变化，这可能无关紧要。）

对于 ARM64，输出是不同的，因为有特殊的加载/存储指令内置了屏障 ( ldar-> load-acquire)。

对于while (!poll1());：

.L16:
    ldar    w0, [x1]
    cmp     w0, 1
    bne     .L16

对于while (!poll2());：

.L24:
    ldr     w0, [x1]
    cmp     w0, 1
    bne     .L24
    dmb     ishld

再次，poll2导致一个循环，其中没有障碍，一个外部，而poll1每次通过都有障碍。

现在，哪个实际上性能更高需要运行基准测试，不幸的是我没有为此设置。poll1与poll2_即使单个（内联）调用poll1花费的时间比那些要长于），循环退出所花费的总时间也可能相同poll2。当然，这是假设一个循环等待标志改变——单独调用poll1 do比单独调用poll2.

所以，我认为总的来说，只要编译器在内联时可以消除分支（至少这三种流行的架构似乎就是这种情况），它的poll2效率永远不会明显低于并且通常可以更快。 poll1）。

我的（略有不同）测试代码供参考：

#include <atomic>
#include <thread>
#include <cstdio>

int sharedState;
std::atomic<int> flag(0);

bool poll1() {
    return (flag.load(std::memory_order_acquire) == 1);
}

bool poll2() {
    int snapshot = flag.load(std::memory_order_relaxed);
    if (snapshot == 1) {
        std::atomic_thread_fence(std::memory_order_acquire);
        return true;
    }
    return false;
}

void __attribute__((noinline)) threadFunc()
{
    while (!poll2());
    std::printf("%d\n", sharedState);
}

int main(int argc, char** argv)
{
    std::thread t(threadFunc);
    sharedState = argc;
    flag.store(1, std::memory_order_release);
    t.join();
    return 0;
}

c++ - 按需条件 std::atomic_thread_fence 获取的优缺点？

1 回答 1

Related

Reference