1

I know this isn't a new issue, but I got confused after reading about c++11 memory fences;

If I have one reader thread and one writer thread.
Can I use an ordinary int?

    int x = 0; // global
writer    reader
x = 1;    printf("%d\n", x);

Is this behavior is undefined?
May I get an undefined value in the reader thread?
Or it's like using an std::atomic_uint_fast32_t or std::atomic<int>? So the value will get to the reader thread - eventually.

    std::atomic<int x = 0; // global
writer                                    reader
x.store(1, std::memory_order_relaxed);    printf("%d\n", x.load(std::memory_order_relaxed));

Does the answer depends on the platform I'm using? (x86 for example), so loading/storing an ordinary int is one CPU instruction?

If both of the behaviors are similar, should I expect the same performance for both of the types?

4

2 回答 2

5

In short, never use a plain int to share in a multi-threaded environment.

The problem is not only your CPU, but your compiler's optimiser. gcc can (and will) optimise code like:

while(i == 1) {} into if(i==1) { while(1) {} }. Once it has checked the variable once, it does not have to reload the value again. that's separate from all the other possible issues, of seeing half-written values (which actually won't usually occur on x86 ints).

measuring the effect of atomic is very hard -- in many cases CPUs can highly optimise the accesses, in others they are much slower. You really have to benchmark in practice.

于 2016-06-02T21:50:10.907 回答
2

Using atomics has effects at both the compiler and CPU level. As the comments suggest, you should always use the atomics, because otherwise the compiler and the CPU will conspire to do crazy and unintuitive transformations on your code to make it not do what you reasonably expect.

The second part of your question is more subtle -- what is the penalty of using an atomic instead of a naked int? This is of course terribly compiler- and CPU-dependent, but let's assume for a moment that your compiler is "smart" and you are on an Intel CPU. By smart, I mean it doesn't just wrap all of your accesses in a mutex block, which certainly meets all of the requirements of the atomics but would be sub-optimal in performance. On Intel CPUs, you have certain built-in guarantees about store/load visibility that makes it easier for the compilers to do the right thing without special instructions -- they just have to not optimize out the "normal" behavior IA64 memory ordering. While this doesn't cover all cases, it does deal with your "relaxed consistency" case. For more info see memory fencing instructions

For your case with relaxed consistency, there is no CPU-level penalty on Intel, because no fence instructions need to be generated. Mostly the penalties come when you need stronger consistency (such as when implementing a spin-lock or when making lock-free algorithms in which publication order is important). These will either result in interlocked instructions, fence instructions or bus-lock prefixes, which can have significant penalties.

于 2016-06-02T23:42:58.423 回答