c++ - 未对齐数据的操作速度

Question

据我所知，CPU 在边界上对齐的基准等于该基准的大小时性能最佳。例如，如果每个int数据大小为 4 个字节，那么每个数据的地址int必须是 4 的倍数才能使 CPU 满意；与 2 字节short数据和 8 字节double数据相同。出于这个原因，new运算符和malloc函数总是返回一个是 8 的倍数的地址，因此是 4 和 2 的倍数。

在我的程序中，一些用于处理大字节数组的时间要求严格的算法允许通过将每个连续的 4 个字节转换为一个来跨越计算，unsigned int并且这样可以更快地执行算术。但是，字节数组的地址不能保证是 4 的倍数，因为可能只需要处理字节数组的一部分。

据我所知，英特尔 CPU 可以正确处理未对齐的数据，但会以牺牲速度为代价。如果对未对齐数据的操作速度足够慢，我的程序中的算法就需要重新设计。在这方面，我有两个问题，第一个问题得到以下代码的支持：

// the address of array0 is a multiple of 4:
unsigned char* array0 = new unsigned char[4];
array0[0] = 0x00;
array0[1] = 0x11;
array0[2] = 0x22;
array0[3] = 0x33;
// the address of array1 is a multiple of 4 too:
unsigned char* array1 = new unsigned char[5];
array1[0] = 0x00;
array1[1] = 0x00;
array1[2] = 0x11;
array1[3] = 0x22;
array1[4] = 0x33;
// OP1: the address of the 1st operand is a multiple of 4,
// which is optimal for an unsigned int:
unsigned int anUInt0 = *((unsigned int*)array0) + 1234;
// OP2: the address of the 1st operand is not a multiple of 4:
unsigned int anUInt1 = *((unsigned int*)(array1 + 1)) + 1234;

所以问题是：

与 x86、x86-64 和 Itanium 处理器上的 OP1 相比，OP2 慢了多少（如果忽略类型转换和地址增量的成本）？
在编写跨平台可移植代码时，对于未对齐的数据访问，我应该关注哪些类型的处理器？（我已经知道 RISC 的了）

score 3 · Accepted Answer

市场上有太多的处理器无法给出一个通用的答案。唯一可以确定的是，某些处理器根本无法进行非对齐访问。如果您的程序打算在同构环境（例如 Windows）中运行，这对您来说可能很重要，也可能无关紧要。

在现代高速处理器中，未对齐访问的速度可能更受其缓存对齐的影响，而不是其地址对齐。在当今的 x86 处理器上，高速缓存行大小为 64 字节。

有一篇维基百科文章可能会提供一些一般性指导：http ://en.wikipedia.org/wiki/Data_structure_alignment

c++ - 未对齐数据的操作速度

1 回答 1

Related

Reference