performance - 无法解释内存带宽数

Question

我写了一个基准来计算内存带宽：

#include <benchmark/benchmark.h>

double sum_array(double* v, long n)
{
    double s = 0;
    for (long i =0 ; i < n; ++i) {
        s += v[i];
    }
    return s;
}


void BM_MemoryBandwidth(benchmark::State& state) {
    long n = state.range(0);
    double* v = (double*) malloc(state.range(0)*sizeof(double));

    for (auto _ : state) {
        benchmark::DoNotOptimize(sum_array(v, n));
    }
    free(v);
    state.SetComplexityN(state.range(0));
    state.SetBytesProcessed(int64_t(state.range(0))*int64_t(state.iterations())*sizeof(double));
}

BENCHMARK(BM_MemoryBandwidth)->RangeMultiplier(2)->Range(1<<5, 1<<23)->Complexity(benchmark::oN);


BENCHMARK_MAIN();

我编译

g++-9 -masm=intel -fverbose-asm -S -g -O3 -ffast-math -march=native --std=c++17 -I/usr/local/include memory_bandwidth.cpp

这会从 RAM 中产生一堆动作，然后是一些说是热的addpd指令perf，所以我进入生成的 asm 并删除它们，然后通过组装和链接

$ g++-9 -c memory_bandwidth.s -o memory_bandwidth.o
$ g++-9 memory_bandwidth.o -o memory_bandwidth.x -L/usr/local/lib -lbenchmark -lbenchmark_main -pthread -fPIC

此时，得到一个perf我期望的输出：将数据移动到xmm寄存器中，指针的增量，以及jmp循环结束时的 a：

一切都很好，直到这里。现在事情变得奇怪了：

我询问我的硬件内存带宽是多少：

$ sudo lshw -class memory
  *-memory
       description: System Memory
       physical id: 3c
       slot: System board or motherboard
       size: 16GiB
      *-bank:1
          description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
          vendor: AMI
          physical id: 1
          slot: ChannelA-DIMM1
          size: 8GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)

所以我最多应该得到 8 个字节 * 2.4 GHz = 19.2 GB/秒。但相反，我得到 48 GB/秒：

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_MemoryBandwidth/32            6.43 ns         6.43 ns    108045392 bytes_per_second=37.0706G/s
BM_MemoryBandwidth/64            11.6 ns         11.6 ns     60101462 bytes_per_second=40.9842G/s
BM_MemoryBandwidth/128           21.4 ns         21.4 ns     32667394 bytes_per_second=44.5464G/s
BM_MemoryBandwidth/256           47.6 ns         47.6 ns     14712204 bytes_per_second=40.0884G/s
BM_MemoryBandwidth/512           86.9 ns         86.9 ns      8057225 bytes_per_second=43.9169G/s
BM_MemoryBandwidth/1024           165 ns          165 ns      4233063 bytes_per_second=46.1437G/s
BM_MemoryBandwidth/2048           322 ns          322 ns      2173012 bytes_per_second=47.356G/s
BM_MemoryBandwidth/4096           636 ns          636 ns      1099074 bytes_per_second=47.9781G/s
BM_MemoryBandwidth/8192          1264 ns         1264 ns       553898 bytes_per_second=48.3047G/s
BM_MemoryBandwidth/16384         2524 ns         2524 ns       277224 bytes_per_second=48.3688G/s
BM_MemoryBandwidth/32768         5035 ns         5035 ns       138843 bytes_per_second=48.4882G/s
BM_MemoryBandwidth/65536        10058 ns        10058 ns        69578 bytes_per_second=48.5455G/s
BM_MemoryBandwidth/131072       20103 ns        20102 ns        34832 bytes_per_second=48.5802G/s
BM_MemoryBandwidth/262144       40185 ns        40185 ns        17420 bytes_per_second=48.6035G/s
BM_MemoryBandwidth/524288       80351 ns        80347 ns         8708 bytes_per_second=48.6171G/s
BM_MemoryBandwidth/1048576     160855 ns       160851 ns         4353 bytes_per_second=48.5699G/s
BM_MemoryBandwidth/2097152     321657 ns       321643 ns         2177 bytes_per_second=48.5787G/s
BM_MemoryBandwidth/4194304     648490 ns       648454 ns         1005 bytes_per_second=48.1915G/s
BM_MemoryBandwidth/8388608    1307549 ns      1307485 ns          502 bytes_per_second=47.8017G/s
BM_MemoryBandwidth_BigO          0.16 N          0.16 N
BM_MemoryBandwidth_RMS              1 %             1 %

我对内存带宽有什么误解，导致我的计算错误超过 2 倍？

（另外，这是一个疯狂的工作流程，凭经验确定我有多少内存带宽。有更好的方法吗？）

sum_array删除添加说明后的完整 asm ：

_Z9sum_arrayPdl:
.LVL0:
.LFB3624:
    .file 1 "example_code/memory_bandwidth.cpp"
    .loc 1 5 1 view -0
    .cfi_startproc
    .loc 1 6 5 view .LVU1
    .loc 1 7 5 view .LVU2
.LBB1545:
# example_code/memory_bandwidth.cpp:7:     for (long i =0 ; i < n; ++i) {
    .loc 1 7 24 is_stmt 0 view .LVU3
    test    rsi, rsi    # n
    jle .L7 #,
    lea rax, -1[rsi]    # tmp105,
    cmp rax, 1  # tmp105,
    jbe .L8 #,
    mov rdx, rsi    # bnd.299, n
    shr rdx # bnd.299
    sal rdx, 4  # tmp107,
    mov rax, rdi    # ivtmp.311, v
    add rdx, rdi    # _44, v
    pxor    xmm0, xmm0  # vect_s_10.306
.LVL1:
    .p2align 4,,10
    .p2align 3
.L5:
    .loc 1 8 9 is_stmt 1 discriminator 2 view .LVU4
# example_code/memory_bandwidth.cpp:8:         s += v[i];
    .loc 1 8 11 is_stmt 0 discriminator 2 view .LVU5
    movupd  xmm2, XMMWORD PTR [rax] # tmp115, MEM[base: _24, offset: 0B]
    add rax, 16 # ivtmp.311,
    .loc 1 8 11 discriminator 2 view .LVU6
    cmp rax, rdx    # ivtmp.311, _44
    jne .L5 #,
    movapd  xmm1, xmm0  # tmp110, vect_s_10.306
    unpckhpd    xmm1, xmm0  # tmp110, vect_s_10.306
    mov rax, rsi    # tmp.301, n
    and rax, -2 # tmp.301,
    test    sil, 1  # n,
    je  .L10    #,
.L3:
.LVL2:
    .loc 1 8 9 is_stmt 1 view .LVU7
# example_code/memory_bandwidth.cpp:8:         s += v[i];
    .loc 1 8 11 is_stmt 0 view .LVU8
    addsd   xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_3
.LVL3:
# example_code/memory_bandwidth.cpp:7:     for (long i =0 ; i < n; ++i) {
    .loc 1 7 5 view .LVU9
    inc rax # i
.LVL4:
# example_code/memory_bandwidth.cpp:7:     for (long i =0 ; i < n; ++i) {
    .loc 1 7 24 view .LVU10
    cmp rsi, rax    # n, i
    jle .L1 #,
    .loc 1 8 9 is_stmt 1 view .LVU11
# example_code/memory_bandwidth.cpp:8:         s += v[i];
    .loc 1 8 11 is_stmt 0 view .LVU12
    addsd   xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_6
.LVL5:
    .loc 1 8 11 view .LVU13
    ret
.LVL6:
    .p2align 4,,10
    .p2align 3
.L7:
    .loc 1 8 11 view .LVU14
.LBE1545:
# example_code/memory_bandwidth.cpp:6:     double s = 0;
    .loc 1 6 12 view .LVU15
    pxor    xmm0, xmm0  # <retval>
    .loc 1 10 5 is_stmt 1 view .LVU16
.LVL7:
.L1:
# example_code/memory_bandwidth.cpp:11: }
    .loc 1 11 1 is_stmt 0 view .LVU17
    ret
    .p2align 4,,10
    .p2align 3
.L10:
    .loc 1 11 1 view .LVU18
    ret
.LVL8:
.L8:
.LBB1546:
# example_code/memory_bandwidth.cpp:7:     for (long i =0 ; i < n; ++i) {
    .loc 1 7 15 view .LVU19
    xor eax, eax    # tmp.301
.LBE1546:
# example_code/memory_bandwidth.cpp:6:     double s = 0;
    .loc 1 6 12 view .LVU20
    pxor    xmm0, xmm0  # <retval>
    jmp .L3 #
    .cfi_endproc
.LFE3624:
    .size   _Z9sum_arrayPdl, .-_Z9sum_arrayPdl
    .section    .text.startup,"ax",@progbits
    .p2align 4
    .globl  main
    .type   main, @function

的完整输出lshw -class memory：

  *-firmware
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: 1.90
       date: 10/21/2016
       size: 64KiB
       capacity: 15MiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
  *-memory
       description: System Memory
       physical id: 3c
       slot: System board or motherboard
       size: 16GiB
     *-bank:0
          description: [empty]
          physical id: 0
          slot: ChannelA-DIMM0
     *-bank:1
          description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
          product: CMU16GX4M2A2400C16
          vendor: AMI
          physical id: 1
          serial: 00000000
          slot: ChannelA-DIMM1
          size: 8GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)
     *-bank:2
          description: [empty]
          physical id: 2
          slot: ChannelB-DIMM0
     *-bank:3
          description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
          product: CMU16GX4M2A2400C16
          vendor: AMI
          physical id: 3
          serial: 00000000
          slot: ChannelB-DIMM1
          size: 8GiB
          width: 64 bits
          clock: 2400MHz (0.4ns)

CPU在这里相关吗？那么这里的规格：

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Pentium(R) CPU G4400 @ 3.30GHz
Stepping:            3
CPU MHz:             3168.660
CPU max MHz:         3300.0000
CPU min MHz:         800.0000
BogoMIPS:            6624.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            3072K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust erms invpcid rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d

clang 编译产生的数据更容易理解。随着向量变得比缓存大得多，性能单调下降，直到达到 19.8Gb/s：

这是基准输出：

score 1 · Accepted Answer

从您的硬件描述看来，您有两个 DIMM 插槽，它们被放置在两个通道中。这会在两个 DIMM 芯片之间交错内存，以便从两个芯片读取内存访问。（一种可能性是字节 0-7 在 DIMM1 中，字节 8-15 在 DIMM2 中，但这取决于硬件实现。）这会使内存带宽加倍，因为您访问的是两个硬件芯片而不是一个。

一些系统支持三个或四个通道，进一步增加了最大带宽。

performance - 无法解释内存带宽数

1 回答 1

Related

Reference