performance - 在 x86-64 上测量 memcpy 的性能

Question

我有 3 个内存块。

char block_a[1600]; // Initialized with random chars
unsigned short block_b[1600]; // Initialized with random shorts 0 - 1599 with no duplication
char block_c[1600]; // Initialized with 0

我正在对此执行以下复制操作

for ( int i = 0; i < 1600; i++ ) {
    memcpy(block_c[i], block_a[block_b[i]], sizeof(block_a[0]); // Point # 1
}

现在我正在尝试测量我在第 1 点执行的上述操作的 NS 中的 CPU 周期 + 时间。

测量环境

1) 平台：英特尔 x86-64。酷睿 i7
2) Linux 内核 3.8

测量算法

0) 实现作为内核模块完成，以便我可以完全控制和精确数据
1) 测量我将用于序列化的 CPUID + MOV 指令的开销。
2) 禁用抢占 + 中断以获取 CPU 的独占访问权
3) 调用 CPUID 以确保到目前为止流水线没有乱序指令
4) 调用 RDTSC 以获取 TSC 的初始值并保存该值
5) 执行上面提到的我要测量的操作
6) 调用 RDTSCP 以获取 TSC 的最终值并保存该值
7) 再次调用 CPUID 以确保没有任何东西以乱序方式进入我们的两个 RDTSC 调用
8) 从起始 TSC 值中减去结束 TSC 值，得到执行此操作所用的 CPU 周期
9) 减去 2 条 MOVE 指令所用的开销周期，得到最终的 CPU 周期。

代码

    ....
    ....
    preempt_disable(); /* Disable preemption to avoid scheduling */
    raw_local_irq_save(flags); /* Disable the hard interrupts */
    /* CPU is ours now */
    __asm__ volatile (
        "CPUID\n\t"
        "RDTSC\n\t"
        "MOV %%EDX, %0\n\t"
        "MOV %%EAX, %1\n\t": "=r" (cycles_high_start), "=r" (cycles_low_start)::
        "%rax", "%rbx", "%rcx", "%rdx"
    );

    /*
     Measuring Point Start
    */
    memcpy(&shuffled_byte_array[idx], &random_byte_array[random_byte_seed[idx]], sizeof(random_byte_array[0]));
    /* 
    * Measuring Point End
    */
    __asm__ volatile (
        "RDTSCP\n\t"
        "MOV %%EDX, %0\n\t"
        "MOV %%EAX, %1\n\t"
        "CPUID\n\t": "=r" (cycles_high_end), "=r" (cycles_low_end)::
        "%rax", "%rbx", "%rcx", "%rdx"
    );

    /* Release CPU */
    raw_local_irq_restore(flags);
    preempt_enable();

    start = ( ((uint64_t)cycles_high_start << 32) | cycles_low_start);
    end   = ( ((uint64_t)cycles_high_end << 32) | cycles_low_end);
    if ( (end-start) >= overhead_cycles ) {
        total = ( (end-start) - overhead_cycles);
    } else {
        // We will consdider last total
    }

问题

我得到的 CPU 周期测量似乎并不现实。给出了一些样本的结果

Cycles Time(NS)
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0000 0000
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0000 0000
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009

如果我再次加载我的模块，给出结果。

Cycles Time(NS)
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0006 0005
0006 0005
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
0017 0014
0011 0009
0011 0009
0000 0000
0000 0000
0000 0000
0011 0009
0000 0000
0000 0000
0011 0009
0011 0009
0011 0009
0000 0000
0022 0018
0006 0005
0011 0009
0006 0005
0006 0005
0104 0086
0104 0086
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0017 0014
0017 0014
0022 0018
0022 0018
0022 0018
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0011 0009
0006 0005
0006 0005
0006 0005
0011 0009
0011 0009
0011 0009
0011 0009
0011 0009
0006 0005
0006 0005
0011 0009
0006 0005
0022 0018
0011 0009
0028 0023
0006 0005
0006 0005
0022 0018
0006 0005
0022 0018
0006 0005
0011 0009
0006 0005
0011 0009
0006 0005
0000 0000
0006 0005
0017 0014
0011 0009
0022 0018
0000 0000
0011 0009
0006 0005
0011 0009
0022 0018
0006 0005
0022 0018
0011 0009
0022 0018
0022 0018
0011 0009
0006 0005
0011 0009
0011 0009
0006 0005
0011 0009
0126 0105
0006 0005
0022 0018
0000 0000
0022 0018
0006 0005
0017 0014
0011 0009
0022 0018
0011 0009
0006 0005
0006 0005
0011 0009

在上面的列表中，您会注意到有许多复制操作我得到了 0 个 CPU 周期。很多时候我看到 < 3 个周期。

您认为 memcpy 操作获得 0 CPU 周期或很少的原因是什么？知道 memcpy 通常占用多少 CPU 周期。

更新

以下更改我已经尝试并得到了结果
1）如果我在重启后使用 memcpy 复制单个字节，则循环时间 0 - 8
2）如果我在重启后使用 memcpy 复制完整块，则循环时间 0
3）BIOS 更改为单核（尽管这代码已经仅在单核上运行，但只是为了确保），对结果没有影响
4）BIOS 更改禁用 Intel SpeedStep 没有任何效果，但一旦解决了这个问题，为了获得最大可能的 CPU 周期 Intel SpeedStep 应该被禁用使 CPU 以最大频率工作。

score 0 · Accepted Answer

看起来缓存是不正确的 CPU 周期的原因（实际上不是不正确的 CPU 周期，但在这种情况下也应该考虑缓存性能测量以获得准确的结果）。在确保给定数据的缓存是清晰的之后，我的结果看起来不错。我添加了以下功能来清除缓存。clflush 函数在内核 API 中可用，它利用 x86 CLFLUSH 指令。

static void flush_cache(char random_byte_array[], char shuffled_byte_array[])
{
    unsigned int idx = 0;
    for ( idx = 0; idx < (MEM_BLOCK_SIZE/64); idx++ ) {
        clflush(random_byte_array+(idx*64));
    }
    for ( idx = 0; idx < (MEM_BLOCK_SIZE/64); idx++ ) {
        clflush(shuffled_byte_array+(idx*64));
    }
}

结果

memcpy 在 1600 字节CPU Cycles = 216 - 260的完整内存块上（用于多个测试>

1600 字节块的单个字节的 memcpy

Cycles Time (ns)
0159 0132
0000 0000
0000 0000
....
....
0049 0040
0049 0040
0049 0040
0000 0000
0000 0000
....
....

对于第一个元素（第 0 个索引）的 memcpy，大约需要 140 - 160 个周期，对于处理一些元素，它需要 0 - 10 个周期，（那是因为我猜数据已加载到缓存中），在更多元素之后它需要140 - 160 个元素（可能发生缓存未命中）

只要数据不在缓存中，我就可以获得良好的 CPU 周期，但是只要数据在缓存中，周期就不足以衡量，可能还应该考虑缓存性能测量。

performance - 在 x86-64 上测量 memcpy 的性能

1 回答 1

Related

Reference