c++ - start_thread 克隆大部分时间都在并行程序中 - 并行化错误或错误报告？

Question

我目前正在对 C++ 程序进行并行化，以提高其在多核系统上的性能。使用 OpenMP 并考虑挑战（线程同步、数据访问等），我们终于找到了使整个程序并行的方法，但性能提升并没有压倒性的优势。

使用英特尔 VTune Amplifier，我进行了热点搜索，发现几乎在每个应该并行执行的函数调用中，来自 libgomp.so 的“start_thread clone”所花费的时间都比函数的实际执行时间要长：

英特尔 VTune 结果图像

这确实出乎意料，因为我检查过，在当前的 OpenMP 实现中，从并行和串行区域切换几乎没有任何损失。根据这个讨论：

线程在您的程序启动时启动（或第一次需要，取决于实现）。在其他任何地方暂停你的程序，你会注意到线程仍然存在

我这样做了，在调试器中停止了程序，在第一个并行区域之前只有一个线程，之后，无论我在哪里停止（并行或串行区域），都有多个线程。所以我确信每次“重生”新线程都不应该有任何开销。

现在 VTune 以不同的方式告诉我，就我能理解的测量结果而言。有人可以在这里帮助我吗？

score 1 · Accepted Answer

Please download corresponding debug package for libgomp containing debug symbols for this library - VTune will be able to classify CPU time better. Presence of libgomp in a hotspot may be caused by Spin Locks as result of bad work splitting.

Also please note that time on screenshot does not mean "start_thread clone" took 11 seconds - it means hotspot (unresolved function) took 11 seconds being called from start_thread through calcGrowthStep region.

c++ - start_thread 克隆大部分时间都在并行程序中 - 并行化错误或错误报告？

1 回答 1

Related

Reference