c++ - 需要分析帮助

Question

我有一个分析问题 - 假设我有以下代码......

void main()
{
    well_written_function();
    badly_written_function();
}
void well_written_function()
{
    for (a small number)
    {
        highly_optimised_subroutine();
    }
}
void badly_written_function()
{
    for (a wastefully and unnecessarily large number)
    {
        highly_optimised_subroutine();
    }
}
void highly_optimised_subroutine()
{
    // lots of code
}

如果我在 vtune（或其他分析器）下运行它，则很难发现任何问题。所有热点将出现在标记为“//很多代码”的部分中，该部分已经优化。badly_written_function() 不会以任何方式突出显示，即使它是所有麻烦的原因。

vtune 是否有一些功能可以帮助我找到问题？

是否有某种模式可以让我找到 badly_written_function()及其所有子功能所花费的时间？

score 1 · Accepted Answer

滚动您自己的非常简单的分析器并不难。插入 main()：

int main()
{
    profileCpuUsage(1);                 // start timer #1
    well_written_function();
    profileCpuUsage(2);                 // stop timer #1, and start timer #2
    badly_written_function();
    profileCpuUsage(-1);                // print stats for timers #1 and #2
    return 0;
}

在哪里：

#define NUMBER(a) ((int)(sizeof(a) / sizeof(a)[0]))

void profileCpuUsage(int slice)
{
    static struct {
        int iterations;
        double elapsedTime;
    } slices[30];                             // 0 is a don't care slice

    if (slice < 0) {                          // -1 = print
        if (slices[0].iterations)
            for (slice = 1; slice < NUMBER(slices); slice++)
                printf("Slice %2d  Iterations %7d  Seconds %7.3f\n", slice,
                    slices[slice].iterations, slices[slice].elapsedTime);
    }
    else {
        static int i;                         // = previous slice
        static double t;                      // = previous t1
        const double t1 = realElapsedTime();  // see below for definition
        assert (slice < NUMBER(slices));
        slices[i].iterations  += 1;
        slices[i].elapsedTime += t1 - t;      // i = 0 first time through
        i = slice;
        t = t1;
    }
}

现在诚然，在您使用此 profileCpuUsage() 的简单示例中并没有增加太多好处。它的缺点是要求您通过在合适的位置调用 profileCpuUsage() 来手动检测代码。

但优点包括：

您可以对任何代码片段进行计时，而不仅仅是程序。
可以快速添加和删除，因为您执行二进制搜索以查找和/或删除代码热点。
它只关注您感兴趣的代码。
便携的！
吻

一件棘手的不可移植的事情是定义函数 realElapsedTime() 以便它提供足够的粒度来获取有效时间。这通常对我有用（使用 CYGWIN 下的 Windows API）：

#include <windows.h>
double realElapsedTime(void)   // <-- granularity about 50 microsec on test machines
{
    static LARGE_INTEGER freq, start;
    LARGE_INTEGER count;
    if (!QueryPerformanceCounter(&count))
        assert(0 && "QueryPerformanceCounter");
    if (!freq.QuadPart) {      // one time initialization
        if (!QueryPerformanceFrequency(&freq))
            assert(0 && "QueryPerformanceFrequency");
        start = count;
    }
    return (double)(count.QuadPart - start.QuadPart) / freq.QuadPart;
}

对于直接的 Unix，有一个共同点：

double realElapsedTime(void)                      // returns 0 first time called
{
    static struct timeval t0;
    struct timeval tv;
    gettimeofday(&tv, 0);
    if (!t0.tv_sec)
        t0 = tv;
    return tv.tv_sec - t0.tv_sec + (tv.tv_usec - t0.tv_usec) / 1000000.;
}

realElapsedTime() 给出挂钟时间，而不是处理时间，这通常是我想要的。

还有其他不太便携的方法可以使用 RDTSC 实现更精细的粒度；参见例如http://en.wikipedia.org/wiki/Time_Stamp_Counter及其链接，但我没有尝试过这些。

编辑： ravenpoint 的非常好的答案似乎与我的不太相似。他的回答使用了很好的描述性字符串，而不仅仅是丑陋的数字，我经常对此感到沮丧。但这可以通过大约十几个额外的行来解决（但这几乎是行数的 两倍！）。

请注意，我们要避免使用 malloc()，我什至对 strcmp() 有点怀疑。所以切片的数量永远不会增加。哈希冲突只是简单地标记它而不是被解决：人类分析器可以通过手动将切片数量从 30 增加或通过更改描述来解决这个问题。未经测试

static unsigned gethash(const char *str)    // "djb2", for example 
{
    unsigned c, hash = 5381;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c;    // hash * 33 + c 
    return hash;
}

void profileCpuUsage(const char *description)
{
    static struct {
        int iterations;
        double elapsedTime;
        char description[20];               // added!
    } slices[30];

    if (!description) {
        // print stats, but using description, mostly unchanged...
    }
    else {
        const int slice = gethash(description) % NUMBER(slices);
        if (!slices[slice].description[0]) { // if new slice
            assert(strlen(description) < sizeof slices[slice].description);
            strcpy(slices[slice].description, description);
        }
        else if (!!strcmp(slices[slice].description, description)) {
            strcpy(slices[slice].description, "!!hash conflict!!");
        }
        // remainder unchanged...
    }
}

另一点是，通常您会希望为发布版本禁用此分析；这也适用于 ravenspoint 的回答。这可以通过使用 evil 宏来定义它的技巧来完成：

#define profileCpuUsage(foo)                // = nothing

如果这样做了，您当然需要在定义中添加括号以禁用禁用宏：

void (profileCpuUsage)(const char *description)...

score 1 · Accepted Answer

1

这通常被称为“callgraph profile”，我很确定 Visual Studio 会这样做。

于 2010-06-16T10:51:30.413 回答

score 1 · Accepted Answer

我可以建议我自己的开源分析器 raven::set::cRunWatch 吗？它正是针对这个问题而设计的，并且可以在使用 Visual Studio 2008 标准版的 Windows 上运行，因此您无需为包含分析器的版本付费。

我已经获取了您的代码，对其进行了轻微的重新排列，因此它可以在没有前向声明的情况下进行编译，并添加了对 cRunWatch 的必要调用

// RunWatchDemo.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

void highly_optimised_subroutine()
{
    raven::set::cRunWatch runwatch("highly_optimised_subroutine");
    Sleep( 2 );
}


void badly_written_function()
{
    raven::set::cRunWatch runwatch("badly_written_function");
    for (int k = 1; k < 1000; k++ )
    {
        highly_optimised_subroutine();
    }
}

void well_written_function()
{
    raven::set::cRunWatch runwatch("well_written_function");
   for (int k = 1; k < 10; k++ )
    {
        highly_optimised_subroutine();
    }
}


int _tmain(int argc, _TCHAR* argv[])
{
raven::set::cRunWatch::Start();

    well_written_function();
    badly_written_function();

raven::set::cRunWatch::Report();

    return 0;
}

运行时会产生输出

raven::set::cRunWatch code timing profile
                    Scope   Calls       Mean (secs)     Total
highly_optimised_subroutine     1008    0.002921        2.944146
   badly_written_function        1      2.926662        2.926662
    well_written_function        1      0.026239        0.026239

这表明 badly_written_function 是非常接近的第二次用户，因此是罪魁祸首。

您可以从此处获取 cRunWatch 您将识别用户指南中的示例代码 :-)

score 0 · Accepted Answer

通常，这是您要观察函数的总时间而不是自身时间的地方，以确保您查看的时间包括被调用函数的时间。

在 VTune 中，我建议使用自上而下的选项卡。或者，更好的是，如果您使用的是最新更新，请尝试新的实验性 Caller-Callee 视图。您可以在此处获取详细信息 - http://software.intel.com/en-us/forums/topic/376210。它会获得一个包含总时间的函数列表，以便您可以查看程序中最耗时的子树。

c++ - 需要分析帮助

4 回答 4

Related

Reference