5

I tried profiling my application with OProfile recently. The data gathered is already very valuable to me, but I'm having difficulties with its precise interpretation. After running my app with oprofile set up and running, I generated the report and got:

root@se7xeon:src# opreport image:test -l -t 1
Overflow stats not available
CPU: P4 / Xeon with 2 hyper-threads, speed 3191.66 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 750000
samples % symbol name
215522 84.9954 cci::Image::interpolate(unsigned char*, cci::Matrix const&) const
17998 7.0979 cci::Calc::diff(unsigned char const*, unsigned char const*)
13171 5.1942 cci::Image::getIRect(unsigned char*, int, int) const
5519 2.1765 cci::Image::getFRect(unsigned char*, double, double) const

Okay, so my interpolation function is responsible for 84% of the application's (too long) execution time. Seems a good idea to look into it then:

root@se7xeon:src# opannotate image:test --source
[...]

/* cci::Image::interpolate(unsigned char*, cci::Matrix<cci::Point2DF> const&) const total: 215522   84.9954 */  
1392  0.5529 :void Image::interpolate(CCIPixel *output, const Matrix<Point2DF> &inputPoints) const throw()  
4  0.0016 :{  
[...]  
:                col0 = static_cast<int>(point[idx].x);  
3  0.0012 :      col1 = col0+1;  
629  0.2498 :    row0 = static_cast<int>(point[idx].y);  
385  0.1529 :    row1 = row0+1;  
56214 22.3266 :  if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:                {  
:                        col0 = row0 = col1 = row1 = 0;  
:                }  

If I understand correctly, the if conditional is responsible for over 22% of the program's execution time. The opening brace and the function declaration seem to take time, is that supposed to correspond to the function call overhead ("push parameters on stack, jump, pop parameters" sequence)?

I changed some things in the source (related to a later bottleneck because I had no idea how to optimize an if), recompiled, ran through oprofile again (not forgetting opcontrol --reset). Now the annotated code looks like this in the same place:

6  0.0024 :     curPx = point[idx].x;  
628  0.2477 :   curPy = point[idx].y;  
410  0.1617 :   col0 = static_cast<int>(curPx);  
57910 22.8380 : col1 = col0+1;  
:               row0 = static_cast<int>(curPy);  
:               row1 = row0+1;  
:               if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:               {  
:                   col0 = row0 = col1 = row1 = 0;  
:               }  

This time the if takes basically no time at all (?), the most expensive instruction is "col1 = col0 + 1", and the whole time-taking block seems to have shifted upwards. How can this be? Can this be trusted at all to pinpoint bottlenecks in the source?

An another point of doubt for me is that when I set up opcontrol, I entered the traced event as GLOBAL_POWER_EVENTS, with the number of samples being 750k. In the output, the interpolation function seems to take 84%, but the number of samples recorded inside it is only a little bit above 200k. That isn't even 50% of the requested number. Am I to understand that the remaining ~500k samples was taken by applications not listed in the output (kernel, Xorg, etc.)?

4

2 回答 2

3

在分析优化代码时,您真的不能依赖准确的源代码行。编译器移动的东西太多了。

要获得准确的图片,您需要查看代码反汇编程序输出。

于 2010-10-27T15:01:54.187 回答
2

OProfile 可以(他们告诉我)在挂钟时间(而不是 CPU)上获取堆栈样本,它可以为您提供行级百分比。您正在寻找的是包含在大部分堆栈样本中的行。

在我完成手动调整代码之前,我不会打开编译器优化,因为它只是隐藏了一些东西。

当您说插值例程使用 84% 的时间时,就会引发一个问题。整个程序需要一些总时间,对吧?这需要 100% 的时间。如果你把程序的时间减半,或者加倍,它仍然需要 100% 的时间。84% 的插值是否太多取决于它是否做得比必要的多。

所以我建议你不要问例行程序的百分比是否太多。相反,您寻找花费大量时间的代码行并询问它们是否可以优化。看到不同?优化代码后,它可以大大减少整体运行时间,但它可能仍然是一个较小的总数的很大百分比。当没有什么占很大比例时,代码不是最佳的。当占很大比例的所有事情中没有一个可以改进时,代码是最佳的。

我不在乎只给出数字的东西。我想要的是洞察力。例如,如果该例程占 84% 的时间,那么如果您对 stack 进行 10 个样本,它将在其中 8.4 个样本上。确切的数字无关紧要。重要的是要了解它为什么在那里。真的有必要在里面这么多吗?这就是查看堆栈样本可以告诉你的。也许您实际上是根据需要多次进行插值?人们经常通过分析原因发现,他们试图加速的例程几乎不需要调用,也许根本不需要调用。我猜不出你的情况。只有检查程序状态的洞察力才能告诉您这一点。

于 2010-10-27T20:56:44.190 回答