c++ - 多线程程序卡在优化模式但在-O0下正常运行

Question

我写了一个简单的多线程程序如下：

static bool finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

它在Visual Studio或gc-O0 c中的调试模式下正常运行，并在几秒钟后打印出结果。但它卡住并且在发布模式下不打印任何内容或.1-O1 -O2 -O3

score 103 · Accepted Answer

访问一个非原子、非保护变量的两个线程是UB这个问题finished。您可以使用finished类型std::atomic<bool>来解决此问题。

我的修复：

#include <iostream>
#include <future>
#include <atomic>

static std::atomic<bool> finished = false;

int func()
{
    size_t i = 0;
    while (!finished)
        ++i;
    return i;
}

int main()
{
    auto result=std::async(std::launch::async, func);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    finished=true;
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

输出：

result =1023045342
main thread id=140147660588864

coliru 现场演示

有人可能会认为'这是一个bool——可能是一点。这怎么可能是非原子的？（当我自己开始使用多线程时，我就这样做了。）

但请注意，不流泪并不是唯一std::atomic能给你的东西。它还明确定义了来自多个线程的并发读写访问，阻止编译器假设重新读取变量将始终看到相同的值。

制作一个bool无人看管的、非原子的可能会导致其他问题：

编译器可能决定将变量优化到寄存器中，甚至将 CSE 多次访问优化为一个并将负载提升到循环之外。
该变量可能会为 CPU 内核缓存。（在现实生活中，CPU 具有一致的缓存。这不是一个真正的问题，但 C++ 标准足够宽松，可以涵盖在非一致共享内存上的假设 C++ 实现，其中atomic<bool>存储memory_order_relaxed/加载可以工作，但在哪里volatile不行。使用volatile 为此将是 UB，即使它在实际 C++ 实现中有效。）

为了防止这种情况发生，必须明确告诉编译器不要这样做。

我对有关与volatile此问题的潜在关系的不断发展的讨论感到有些惊讶。因此，我想花我的两分钱：

score 44 · Accepted Answer

Scheff's answer describes how to fix your code. I thought I would add a little information on what is actually happening in this case.

I compiled your code at godbolt using optimisation level 1 (-O1). Your function compiles like so:

func():
  cmp BYTE PTR finished[rip], 0
  jne .L4
.L5:
  jmp .L5
.L4:
  mov eax, 0
  ret

So, what is happening here? First, we have a comparison: cmp BYTE PTR finished[rip], 0 - this checks to see if finished is false or not.

If it is not false (aka true) we should exit the loop on the first run. This accomplished by jne .L4 which jumps when not equal to label .L4 where the value of i (0) is stored in a register for later use and the function returns.

If it is false however, we move to

.L5:
  jmp .L5

This is an unconditional jump, to label .L5 which just so happens to be the jump command itself.

In other words, the thread is put into an infinite busy loop.

So why has this happened?

As far as the optimiser is concerned, threads are outside of its purview. It assumes other threads aren't reading or writing variables simultaneously (because that would be data-race UB). You need to tell it that it cannot optimise accesses away. This is where Scheff's answer comes in. I won't bother to repeat him.

Because the optimiser is not told that the finished variable may potentially change during execution of the function, it sees that finished is not modified by the function itself and assumes that it is constant.

The optimised code provides the two code paths that will result from entering the function with a constant bool value; either it runs the loop infinitely, or the loop is never run.

at -O0 the compiler (as expected) does not optimise the loop body and comparison away:

func():
  push rbp
  mov rbp, rsp
  mov QWORD PTR [rbp-8], 0
.L148:
  movzx eax, BYTE PTR finished[rip]
  test al, al
  jne .L147
  add QWORD PTR [rbp-8], 1
  jmp .L148
.L147:
  mov rax, QWORD PTR [rbp-8]
  pop rbp
  ret

therefore the function, when unoptimised does work, the lack of atomicity here is typically not a problem, because the code and data-type is simple. Probably the worst we could run into here is a value of i that is off by one to what it should be.

A more complex system with data-structures is far more likely to result in corrupted data, or improper execution.

score 5 · Accepted Answer

为了学习曲线的完整性；你应该避免使用全局变量。尽管将其设为静态，但您做得很好，因此它将位于翻译单元的本地。

这是一个例子：

class ST {
public:
    int func()
    {
        size_t i = 0;
        while (!finished)
            ++i;
        return i;
    }
    void setFinished(bool val)
    {
        finished = val;
    }
private:
    std::atomic<bool> finished = false;
};

int main()
{
    ST st;
    auto result=std::async(std::launch::async, &ST::func, std::ref(st));
    std::this_thread::sleep_for(std::chrono::seconds(1));
    st.setFinished(true);
    std::cout<<"result ="<<result.get();
    std::cout<<"\nmain thread id="<<std::this_thread::get_id()<<std::endl;
}

活在魔杖盒上

c++ - 多线程程序卡在优化模式但在-O0下正常运行

3 回答 3

Related

Reference