linux - Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

Question

I'm writing a ray tracer.

Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.

In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.

I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.

Host system: Ubuntu Linux 10.4 AMD64.

I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.

Any idea why I'm seeing this behavior?

Debug version is compiled with "-g3 -pg". Optimized version with "-O3".

Optimized no threading:        0m4.864s
Optimized threading:           0m2.075s

Debug no threading:            0m30.351s
Debug threading:               0m39.860s
Debug threading after "strip": 0m39.767s

Debug no threading (no-pg):    0m10.428s
Debug threading (no-pg):       0m4.045s

This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.

Since "-pg" is broken on threaded applications anyway, I'll just remove it.

score 8 · Accepted Answer

没有-pg国旗你会得到什么？那不是调试符号（不影响代码生成），那是用于分析（确实如此）。

很有可能在多线程进程中进行分析需要额外的锁定，这会减慢多线程版本的速度，甚至会使其比非多线程版本慢。

score 2 · Accepted Answer

你在这里谈论两个不同的事情。调试符号和编译器优化。如果您使用编译器必须提供的最强优化设置，那么这样做的后果是丢失了对调试有用的符号。

您的应用程序并没有因为调试符号而运行得更慢，它的运行速度更慢是因为编译器完成的优化较少。

除了占用更多磁盘空间之外，调试符号并不是“开销”。以最大优化 (-O3) 编译的代码不应添加调试符号。这是您在不需要所述符号时设置的标志。

如果您需要调试符号，则以失去编译器优化为代价获得它们。然而，再一次，这不是“开销”，它只是没有编译器优化。

score 2 · Accepted Answer

配置文件代码是否在足够多的函数中插入检测调用来伤害您？
如果您在汇编语言级别单步执行，您会很快发现。

score 0 · Accepted Answer

多线程代码执行时间并不总是如 gprof 预期的那样测量。除了 gprof 之外，您还应该使用其他计时器对代码进行计时以查看差异。

我的示例：在 2NUMA 节点 INTEL 沙桥（8 核 + 8 核）上运行 LULESH CORAL 基准测试，大小为 -s 50 和 20 次迭代 -i，使用 gcc 6.3.0，-O3 编译，我有：

运行 1 个线程：没有 -pg的~3,7和有它的~3,8，但根据 gprof 分析，代码只运行了 3,5。

运行 16 个线程：没有 -pg的~0,6和有它的~0,8，但根据 gprof 分析，代码已经运行了 ~4,5 ...

粗体的时间是 gettimeofday 测量的，在并行区域之外（主函数的开始和结束）。

因此，也许如果您以相同的方式测量您的应用程序时间，您会看到使用和不使用 -pg 时相同的 speeduo。只是 gprof 度量是并行错误的。无论哪种方式，在 LULESH openmp 版本中。

linux - Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

4 回答 4

Related

Reference