c++ - 字节对齐和错误共享导致 x86-64 上的性能差异

Question

环境：x86-64；linux-centos；8-cpu-core
为了测试“错误共享性能”，我编写了这样的 c++ 代码：

volatile int32_t a;
volatile int32_t b;
int64_t p1[7];
volatile int64_t c;
int64_t p2[7];
volatile int64_t d;

void thread1(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        a = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 1 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread2(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        b = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 2 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread3(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        c = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 3 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread4(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        d = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 4 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

这是我的编译命令：g++ xxx.cpp --std=c++11 -O0 -lpthread -g 所以没有 opt(O0)

我打印 a、b、c、d 虚拟地址是

a addr 0x406200
b addr 0x406204
c addr 0x406258
d addr 0x406298

这是执行结果：

 4 cost:2186474910
 3 cost:6114449628
 1 cost:7464439728
 2 cost:7469428696

据我了解，thread3 与其他线程没有“缓存弹跳”或“错误共享”问题，那么为什么它比线程 4 慢？

另外：如果我更改int32_t a,b为int64_t a,b，结果将更改为：

a addr 0x4061e0
b addr 0x4061e8
c addr 0x406238
d addr 0x406278
3 cost:2188341526
4 cost:2193782423
2 cost:6479324727
1 cost:6645607256

这是我预测的

c++ - 字节对齐和错误共享导致 x86-64 上的性能差异

0 回答 0

Related

Reference