3

我在玩高精度计时器,我的第一个测试是使用 rdtsc 来测量 printf。下面是我的测试 prpgram,然后是它的输出。我注意到的是,第一次运行 printf 时,第一次打印的时间总是比后续打印的时间长约 25 倍。这是为什么?

#include <stdio.h>
#include <stdint.h>

// Sample code grabbed from wikipedia
__inline__ uint64_t rdtsc(void)
{
    uint32_t lo, hi;
    __asm__ __volatile__ (
            "xorl %%eax,%%eax \n        cpuid"
            ::: "%rax", "%rbx", "%rcx", "%rdx");
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return (uint64_t)hi << 32 | lo;
}

int main(int argc, const char *argv[])
{
    unsigned int i;
    uint64_t counter[10];
    uint64_t sum = 0;
    for (i = 0; i < 10; i++)
    {
        counter[i] = rdtsc();
        printf("Hello, world\n");
        counter[i] = rdtsc() - counter[i];
    }

    for (i = 0; i < 10; i++)
    {
        printf("counter[%d] = %lld\n", i, counter[i]);
        sum += counter[i];
    }
    printf("avg = %lld\n", sum/10);
    return 0;
}

和输出:

Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
counter[0] = 108165
counter[1] = 6375
counter[2] = 4388
counter[3] = 4388
counter[4] = 4380
counter[5] = 4545
counter[6] = 4215
counter[7] = 4290
counter[8] = 4237
counter[9] = 4320
avg = 14930

(作为参考,这是在 OSX 上使用 gcc 编译的)

4

5 回答 5

5

我的猜测是,在第一次调用 printf 时,stdout 资源不在缓存中,调用需要将其放入缓存中 - 因此速度较慢。对于所有后续调用,缓存已经是温暖的。

A second possible explanation is that, if this is on Linux (may also apply to OSX, I'm not sure), the program needs to set the stream orientation. (ASCII vs. UNICODE) This is done on the first call to a function using that stream and is static until the stream closes. I don't know what the overhead of setting this orientation is, but it's a one-time cost.

Please feel free to correct me if anyone thinks I'm completely wrong.

于 2011-09-02T14:35:56.380 回答
5

Perhaps the first time, the code for printf isn't in the instruction cache, so it has to be loaded in from main memory. On subsequent runs, it's already in the cache.

于 2011-09-02T14:36:53.757 回答
4

That's about 50 microseconds. Perhaps a caching issue? Too short to be anything to do with being loaded from the harddrive, but believable for loading a large chunk of C I/O library from RAM.

于 2011-09-02T14:37:51.993 回答
4

It can be some sort of lazy initialization.

于 2011-09-02T14:38:44.117 回答
1

In both hardware and software design, there's an overriding principle which suggests that the execution speed of something that's done a million times is far more important than the execution speed of something that's done once. A corollary of this is that if something is done a million times, the time required to do something the first time is far less important than the time required for the other 999,999. One of the biggest reasons computers are so much faster today than 25 years ago is that designers have focus on making repeated operations faster, even when doing so might slow down the performance of one-off operations.

As a simple example from a hardware perspective, consider two approaches to memory design: (1) there is a single memory store, and every operation takes sixty nanoseconds to complete; (2) there are several levels of cache; fetching a word which is held in the first level of cache will take one nanosecond; a word which isn't there, but is held in the second level will take five; a word which isn't there but is in the third level will take ten, and one which isn't there will take sixty. If all memory accesses were totally random, the first design would not only be simpler than the second, but it would also perform better. Most memory accesses would cause the CPU to waste ten nanoseconds looking up data in the cache before going out and fetching it from main memory. On the other hand, if 80% of memory accesses are satisfied by the first cache level, 16% by the second, and 3% by the third, so only one in a hundred have to go out to main memory, then the average time for those memory accesses will be 2.5ns. That's forty times as fast, on average, as the simpler memory system.

Even if an entire program is pre-loaded from disk, the first time a routine like "printf" is run, neither it nor any data it requires is likely to be in any level of cache. Consequently, slow memory accesses will be required the first time it's run. On the other hand, once the code and much of its required data have been cached, future executions will be much faster. If a repeated execution of a piece of code occurs while it is still in the fastest cache, the speed difference can easily be an order of magnitude. Optimizing for the fast case will in many cases cause one-time execution of code to be much slower than it otherwise would be (to an even greater extent than suggested by the example above) but since many processors spend much of their time running little pieces of code millions or billions of time, the speedups obtained in those situations far outweigh any slow-down in the execution of routines that only run once.

于 2011-09-02T14:58:48.537 回答