performance - 现代 CPU 的每刻缓存带宽

Question

现代 CPU 的缓存访问速度是多少？Intel P4，Core2，Corei7，AMD每个处理器时钟周期可以从内存读取或写入多少字节？

请用理论（ld/sd 单元的宽度及其以 uOPs/tick 为单位的吞吐量）和实际数字（甚至 memcpy 速度测试或 STREAM 基准测试）（如果有）来回答。

PS这是一个问题，与汇编程序中加载/存储指令的最大速率有关。可以有理论上的加载速率（所有指令每个 Tick 都是最宽的加载），但处理器只能给出其中的一部分，即实际的加载限制。

score 10 · Accepted Answer

对于 nehalem：rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.

128 位 = 16 字节 / 时钟读取和 128 位 = 16 字节 / 时钟写入（我可以在单个周期中组合读取和写入吗？）

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

L2 和 L3 读写端口可以在单个时钟中使用吗？

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

延迟（时钟滴答声），一些由 CPU-Z 的latencytool或 lmbench 的 lat_mem_rd 测量 - 两者都使用长链表遍历来正确测量现代无序内核，如 Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/

延迟数据的良好来源是7cpu 网站，例如 Haswell：http ://www.7-cpu.com/cpu/Haswell.html

有关 lat_mem_rd 程序的更多信息，请参见其手册页或此处的 SO。

score 7 · Accepted Answer

最宽的读/写是 128 位（16 字节）SSE 加载/存储。L1/L2/L3 缓存具有不同的带宽和延迟，这些当然是特定于 CPU 的。现代 CPU 上的典型 L1 延迟为 2 - 4 个时钟，但您通常可以在每个时钟发出 1 或 2 个加载指令。

我怀疑这里潜伏着一个更具体的问题——你真正想要实现的是什么？你只想写尽可能快的 memcpy 吗？

performance - 现代 CPU 的每刻缓存带宽

2 回答 2

Related

Reference