c - 我不知道为什么在 pthread 子例程中更改变量访问/存储类型会大幅提高性能

Question

我是多线程编程的新手，我知道如果你不小心就会有一些奇怪的副作用，但我没想到会对我编写的代码感到困惑。我正在写我认为是线程的明显开始/测试：只是总结 0 到 x 之间的数字（当然https://www.reddit.com/r/mathmemes/comments/gq36wb/nn12/但我想做的更多的是练习如何使用线程，而不是如何使该程序尽可能快）。我使用函数调用来创建基于系统上硬编码内核数的线程，以及定义处理器是否具有多线程功能的“布尔值”。我将工作或多或少均匀地分配到每个线程中，因此每个线程在一个范围之间求和，理论上，如果所有线程都设法一起工作，我可以执行 numcores*normal_computation，这确实令人兴奋，令我惊讶的是，它或多或少符合我的预期；直到我做了一些调整。

在继续之前，我认为一些代码会有所帮助：

这些是我在基本代码中使用的预处理器定义：

#define NUM_CORES 4
#define MULTI_THREADED 1 //1 for true, 0 for false
#define BIGVALUE 1000000000UL

我使用这个结构将参数传递给我的面向线程的函数：

typedef struct sum_args
{
    int64_t start;
    int64_t end;
    int64_t return_total;
} sum_args;

这是制作线程的函数：

int64_t SumUpTo_WithThreads(int64_t limit)
{   //start counting from zero
    const int numthreads = NUM_CORES + (int)(NUM_CORES*MULTI_THREADED*0.25);
    pthread_t threads[numthreads];
    sum_args listofargs[numthreads];
    int64_t offset = limit/numthreads; //loss of precision after decimal be careful
    int64_t total = 0;

    //i < numthread-1 since offset is not assured to be exactly limit/numthreads due to integer division
    for (int i = 0; i < numthreads-1; i++)
    {
        listofargs[i] = (sum_args){.start = offset*i, offset*(i+1)};
        pthread_create(&threads[i], NULL, SumBetween, (void *)(&listofargs[i]));
    }
    //edge case catch
    //limit + 1, since SumBetween() is not inclusive of .end aka stops at .end -1 for each loop
    listofargs[numthreads-1] = (sum_args){.start = offset*(numthreads-1), .end = limit+1};
    pthread_create(&threads[numthreads-1], NULL, SumBetween, (void *)(&listofargs[numthreads-1]));

    //finishing
    for (int i = 0; i < numthreads; i++)
    {
        pthread_join(threads[i], NULL); //used to ensure thread is done before adding .return_total
        total += listofargs[i].return_total;
    }

    return total;
}

这只是求和的“正常”实现，只是为了比较：

int64_t SumUpTo(int64_t limit)
{
    uint64_t total = 0;
    for (uint64_t i = 0; i <= limit; i++)
        total += i;
    return total;
}

这是线程运行的函数，它有“两个实现”，一个出于某种原因快速实现，一个出于某种原因慢速实现（这是我所困惑的）：额外说明：我使用预处理器指令只是为了让 SLOWER 和 FASTER 版本更容易编译。

void* SumBetween(void *arg)
{
    #ifdef SLOWER
    ((sum_args *)arg)->return_total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        ((sum_args *)arg)->return_total += i;
    #endif

    #ifdef FASTER
    uint64_t total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        total += i;
    ((sum_args *)arg)->return_total = total;
    #endif
    
    return NULL;
}

这是我的主要内容：

int main(void)
{
    #ifdef THREADS
    printf("%ld\n", SumUpTo_WithThreads(BIGVALUE));
    #endif

    #ifdef NORMAL
    printf("%ld\n", SumUpTo(BIGVALUE));
    #endif 
    return 0;
}

这是我的编译（我确保将优化级别设置为0，以避免编译器完全优化出愚蠢的求和程序，毕竟我想学习如何使用线程！！！）：

make faster
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DFASTER -o faster.exe

make slower
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DSLOWER -o slower.exe

clang --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

以下是结果/差异（注意，使用 GCC 生成的代码也有相同的副作用）：

slower:
sudo time ./slower.exe 
500000000500000000
14.63user 0.00system 0:03.22elapsed 453%CPU (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps

faster:
sudo time ./faster.exe 
500000000500000000
2.97user 0.00system 0:00.67elapsed 440%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps

为什么使用额外的堆栈定义变量比取消引用传入的结构指针快得多！

我试图自己找到这个问题的答案。我最终做了一些测试，从我的 SumUpTo() 函数中实现了相同的基本/朴素求和算法，唯一的区别是它正在处理的数据间接。

结果如下：

Choose a function to execute!

int64_t sum(void) took: 2.207833 (s) //new stack defined variable, basically a copy of SumUpTo() func
void sumpoint(int64_t *total) took: 2.467067 (s)
void sumvoidpoint(void *total) took: 2.471592 (s)
int64_t sumstruct(void) took: 2.742239 (s)
void sumstructpoint(numbers *p) took: 2.488190 (s)
void sumstructvoidpoint(void *p) took: 2.486247 (s)
int64_t sumregister(void) took: 2.161722 (s)
int64_t sumregisterV2(void) took: 2.157944 (s)

测试产生了我或多或少预期的值。因此，我推断它必须是这个想法之上的东西。

只是为了添加更多信息，我正在运行 Linux，特别是 Mint 发行版。

我的处理器信息如下：

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   36 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           42
Model name:                      Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
Stepping:                        7
CPU MHz:                         813.451
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4784.41
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        6 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cach
                                 e flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user
                                  pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB condit
                                 ional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush dts acpi mmx f
                                 xsr sse sse2 ht tm pbe syscall nx rdtscp lm con
                                 stant_tsc arch_perfmon pebs bts nopl xtopology 
                                 nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes
                                 64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xt
                                 pr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_de
                                 adline_timer aes xsave avx lahf_lm epb pti ssbd
                                  ibrs ibpb stibp tpr_shadow vnmi flexpriority e
                                 pt vpid xsaveopt dtherm ida arat pln pts md_cle
                                 ar flush_l1d

如果您想自己编译代码，或者查看为我的特定实例生成的程序集，请查看：https ://github.com/spaceface102/Weird_Threads 主要源代码是“countV2.c”，以防万一丢失的。感谢您的帮助！

/*EOPost*/

c - 我不知道为什么在 pthread 子例程中更改变量访问/存储类型会大幅提高性能

0 回答 0

Related

Reference