我是多线程编程的新手,我知道如果你不小心就会有一些奇怪的副作用,但我没想到会对我编写的代码感到困惑。我正在写我认为是线程的明显开始/测试:只是总结 0 到 x 之间的数字(当然https://www.reddit.com/r/mathmemes/comments/gq36wb/nn12/但我想做的更多的是练习如何使用线程,而不是如何使该程序尽可能快)。我使用函数调用来创建基于系统上硬编码内核数的线程,以及定义处理器是否具有多线程功能的“布尔值”。我将工作或多或少均匀地分配到每个线程中,因此每个线程在一个范围之间求和,理论上,如果所有线程都设法一起工作,我可以执行 numcores*normal_computation,这确实令人兴奋,令我惊讶的是,它或多或少符合我的预期;直到我做了一些调整。
在继续之前,我认为一些代码会有所帮助:
这些是我在基本代码中使用的预处理器定义:
#define NUM_CORES 4
#define MULTI_THREADED 1 //1 for true, 0 for false
#define BIGVALUE 1000000000UL
我使用这个结构将参数传递给我的面向线程的函数:
typedef struct sum_args
{
int64_t start;
int64_t end;
int64_t return_total;
} sum_args;
这是制作线程的函数:
int64_t SumUpTo_WithThreads(int64_t limit)
{ //start counting from zero
const int numthreads = NUM_CORES + (int)(NUM_CORES*MULTI_THREADED*0.25);
pthread_t threads[numthreads];
sum_args listofargs[numthreads];
int64_t offset = limit/numthreads; //loss of precision after decimal be careful
int64_t total = 0;
//i < numthread-1 since offset is not assured to be exactly limit/numthreads due to integer division
for (int i = 0; i < numthreads-1; i++)
{
listofargs[i] = (sum_args){.start = offset*i, offset*(i+1)};
pthread_create(&threads[i], NULL, SumBetween, (void *)(&listofargs[i]));
}
//edge case catch
//limit + 1, since SumBetween() is not inclusive of .end aka stops at .end -1 for each loop
listofargs[numthreads-1] = (sum_args){.start = offset*(numthreads-1), .end = limit+1};
pthread_create(&threads[numthreads-1], NULL, SumBetween, (void *)(&listofargs[numthreads-1]));
//finishing
for (int i = 0; i < numthreads; i++)
{
pthread_join(threads[i], NULL); //used to ensure thread is done before adding .return_total
total += listofargs[i].return_total;
}
return total;
}
这只是求和的“正常”实现,只是为了比较:
int64_t SumUpTo(int64_t limit)
{
uint64_t total = 0;
for (uint64_t i = 0; i <= limit; i++)
total += i;
return total;
}
这是线程运行的函数,它有“两个实现”,一个出于某种原因快速实现,一个出于某种原因慢速实现(这是我所困惑的):额外说明:我使用预处理器指令只是为了让 SLOWER 和 FASTER 版本更容易编译。
void* SumBetween(void *arg)
{
#ifdef SLOWER
((sum_args *)arg)->return_total = 0;
for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
((sum_args *)arg)->return_total += i;
#endif
#ifdef FASTER
uint64_t total = 0;
for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
total += i;
((sum_args *)arg)->return_total = total;
#endif
return NULL;
}
这是我的主要内容:
int main(void)
{
#ifdef THREADS
printf("%ld\n", SumUpTo_WithThreads(BIGVALUE));
#endif
#ifdef NORMAL
printf("%ld\n", SumUpTo(BIGVALUE));
#endif
return 0;
}
这是我的编译(我确保将优化级别设置为0,以避免编译器完全优化出愚蠢的求和程序,毕竟我想学习如何使用线程!!!):
make faster
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DFASTER -o faster.exe
make slower
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DSLOWER -o slower.exe
clang --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
以下是结果/差异(注意,使用 GCC 生成的代码也有相同的副作用):
slower:
sudo time ./slower.exe
500000000500000000
14.63user 0.00system 0:03.22elapsed 453%CPU (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps
faster:
sudo time ./faster.exe
500000000500000000
2.97user 0.00system 0:00.67elapsed 440%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps
为什么使用额外的堆栈定义变量比取消引用传入的结构指针快得多!
我试图自己找到这个问题的答案。我最终做了一些测试,从我的 SumUpTo() 函数中实现了相同的基本/朴素求和算法,唯一的区别是它正在处理的数据间接。
结果如下:
Choose a function to execute!
int64_t sum(void) took: 2.207833 (s) //new stack defined variable, basically a copy of SumUpTo() func
void sumpoint(int64_t *total) took: 2.467067 (s)
void sumvoidpoint(void *total) took: 2.471592 (s)
int64_t sumstruct(void) took: 2.742239 (s)
void sumstructpoint(numbers *p) took: 2.488190 (s)
void sumstructvoidpoint(void *p) took: 2.486247 (s)
int64_t sumregister(void) took: 2.161722 (s)
int64_t sumregisterV2(void) took: 2.157944 (s)
测试产生了我或多或少预期的值。因此,我推断它必须是这个想法之上的东西。
只是为了添加更多信息,我正在运行 Linux,特别是 Mint 发行版。
我的处理器信息如下:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 36 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Model name: Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
Stepping: 7
CPU MHz: 813.451
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 4784.41
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach
e flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
ia prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB condit
ional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr
r pge mca cmov pat pse36 clflush dts acpi mmx f
xsr sse sse2 ht tm pbe syscall nx rdtscp lm con
stant_tsc arch_perfmon pebs bts nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes
64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xt
pr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_de
adline_timer aes xsave avx lahf_lm epb pti ssbd
ibrs ibpb stibp tpr_shadow vnmi flexpriority e
pt vpid xsaveopt dtherm ida arat pln pts md_cle
ar flush_l1d
如果您想自己编译代码,或者查看为我的特定实例生成的程序集,请查看:https ://github.com/spaceface102/Weird_Threads 主要源代码是“countV2.c”,以防万一丢失的。感谢您的帮助!
/*EOPost*/