arrays - Why does order of array declaration affect performance so much?

Question

First, in tuning a frequency analysis function using the Accelerate framework, the absolute system time has consistently been 225ms per iteration. Then last night I changed the order of which two of the arrays were declared and suddenly it went down to 202ms. A 10% increase by just changing the declaration order seems insane. Can someone explain to me why the compiler (which is set to optimize) is not already finding this solution?

Additional info: Before the loop there is some setup of the arrays used in the loop consisting of converting them from integer to float arrays (for Accelerate) and then taking sin and cos of the time array (16 lines long). All of the float arrays (8 arrays x 1000 elements) are declared first in the function (after a sanity check of the parameters). They are always declared the same size (by a constant), because otherwise performance suffered for little shrinkage of the footprint. I tested making them globals, but I think the compiler already figured that out as there is no performance change. The loop is 25 lines long.

---Additions---

Yes, "-Os" is the flag. (default in Xcode anyways: Fastest, Smallest)

(below is from memory - don't try to compile it, cause I didn't put in things like stride (which is 1), etc. However, all of the Accelerate calls are there)

passed parameters: inttimearray, intamparray, length, scale1, scale2, amp

float trigarray1[maxsize];
float trigarray2[maxsize];
float trigarray3[maxsize];
float trigarray4[maxsize];
float trigarray5[maxsize];
float temparray[maxsize];
float amparray[maxsize];    //these two make the most change
float timearray[maxsize];    //these two make the most change

vDSP_vfltu32(inttimearray,timearray,length); //convert to float array
vDSP_vflt16(intamparray,amparray,length);    //convert to float array

vDSP_vsmul(timearray,scale1,temparray,length);    //scale time and store in temp
vvcosf(temparray,trigarray3,length);     //cos of temparray
vvsinf(temparray,trigarray4,length);     //sin of temparray
vDSP_vneg(trigarray4,trigarray5,length); //negative of trigarray4

vDSP_vsmul(timearray,scale2,temparray,length); //scale time and store in temp
vvcosf(temparray,trigarray1,length);           //cos of temparray
vvsinf(temprray,trigarray2,length);            //sin of temparray

float ysum;
vDSP_sve(amparray,ysum,length);    //sum of amparray

float csum, ssum, ccsum, sssum, cssum, ycsum, yssum;

for (i = 0; i<max; i++) {

    vDSP_sve(trigarray1,csum,length);    //sum of trigarray1
    vDSP_sve(trigarray2,ssum,length);    //sum of trigarray2

    vDSP_svesq(trigarray1,ccsum,length); //sum of trigarray1^2
    vDSP_svesq(trigarray2,sssum,length); //sum of trigarray2^2

    vDSP_vmul(trigarray1,trigarray2,temparray,length); //temp = trig1*trig2
    vDSP_sve(temparray,cssum,length);                  //sum of temp array
    // 2 more sets of the above 2 lines, for the 2 remaining sums

    amp[i] = (arithmetic of sums);

    //trig identity to increase the sin/cos by a delta frequency
    //vmma is a*b+c*d=result
    vDSP_vmma (trigarray1,trigarray3,trigarray2,trigarray4,temparray,length);
    vDSP_vmma (trigarray2,trigarray3,trigarray1,trigarray5,trigarray2,length);
    memcpy(trigarray1,temparray,length*sizeof(float));
}

---Current Solution---

I've made some changes as follows:

The arrays are all declared aligned, and zero'd out (I'll explain next) and maxsize is now a multiple of 16

__attribute__ ((align (16))) float timearray[maxsize] = {0};

I've zero'd out all of the arrays because now, when the length is less than maxsize, I round the length up to the nearest multiple of 16 so that all of the looped functions operate on widths divisible by 16, without affecting the sums.

The benefits are:

Slight performance boost
The speed is nearly constant regardless of order of array declaration (which is now done right before they are needed, instead of all in a big block)
The speed is also nearly constant for any 16-wide length (i.e. 241 to 256, or 225 to 240...), whereas before, if the length went from 256 to 255, the function would take a 3+% performance hit.

In the future (possibly with this code, as analysis requirements are still in flux), I realize I'll need to take into consideration stack usage more, and alignment/chunks of vectors. Unfortunately, for this code, I can't make these arrays static or globals as this function can be called by more than one object at a time.

score 3 · Accepted Answer

我首先怀疑的是对齐。您可能想尝试：

__attribute__ ((align (16))) float ...[maxsize];

或者确保它maxsize是 16 的倍数。如果在一种配置中对齐而在另一种配置中没有对齐，那肯定会导致 10% 的命中率。矢量运算可能对此非常敏感。

您可能遇到的下一个主要问题是一个巨大的堆栈（假设maxsize相当大）。ARM 处理小于 4k 的数字比处理大于 4k 的数字更有效（因为它只能处理 12 位立即数）。因此，根据编译器对其进行优化的方式，将 amparray 向下推入堆栈可能会导致更复杂的数学运算来访问它。

当小事导致性能大的变化时，我总是建议拉起程序集（产品>生成输出>程序集）并查看编译器输出的变化。我还强烈推荐ARM Assembly 的旋风之旅，让您开始了解您正在查看的内容。（确保将输出设置为“用于存档”，以便看到优化的结果。）

您还应该做更多的事情：

尝试将此例程重写为简单的 C 而不是使用 Accelerate。是的，我知道 Accelerate 总是更快，但事实并非如此。所有这些函数调用都非常昂贵，并且编译器通常可以更好地向量化简单的乘法和加法，而 Accelerate 可以根据我的经验。如果您的步幅为 1，您的向量不是很大，并且您使用的是 1-2 核设备（如 iPad），则尤其如此。当您拥有处理跨步的代码时（如果您不需要跨步），它比您手动编写的代码更复杂（更慢）。根据我的经验，Accelerate 似乎非常擅长斜坡和超越（例如大表的余弦），但在简单的向量和矩阵数学方面几乎没有那么好。
如果这段代码对你来说真的很重要，我发现手写程序集肯定会超过编译器。我什至不擅长 ARM 汇编程序，而且我已经能够在简单的矩阵数学上击败编译器 2 倍（并且编译器粉碎了 Accelerate）。我在这里特别谈论你的循环，它似乎只是在做加法和乘法。手写汇编当然是一件痛苦的事，然后你必须为汇编程序维护一个 C 版本，但是当它真的很重要时，它真的很快。

score 2 · Accepted Answer

如果没有可运行的代码，可能很难确定存在哪些性能障碍。

我将使用这个答案来提出一些可能性，并对这个问题的其他答案和评论中提出的一些问题发表评论。

首先，有 7 个每个 4 kB 的数组，您使用的几乎是 L1 缓存的大小。根据堆栈使用的其他数量等，您可能会破坏缓存。这可以解释为什么减小块大小会提高性能：对于更小的块，每次迭代使用的内存更少，并且所有内存都适合缓存，因此在迭代期间很少或没有丢弃。处理这种缓存抖动的另一种方法是条带挖掘：不是在整个长度上执行 sve、svesq、vmul、vmma 和 memcpy，而是在长度的一部分（例如一半）上执行所有这些，然后执行所有这些都放在另一部分，并根据需要重复，直到它们被完全处理。

trigarray5 仅存在以便第二个 vmma 否定 trigarray4。消除 trigarray5 并使用 trigarray4 调用 vmmsb（减去而不是添加）。这也减少了内存使用。

即使使用的数据少于填充缓存，缓存几何有时也会导致抖动。缓存被划分为多个集合，每个内存地址必须映射到一个特定的集合。例如，一个 32,768 字节的高速缓存可能有 1024 条“线”，每条 32 字节，但它可以被组织成 256 组，每组 4 条线。任何一个内存地址都映射到一组，并且它必须使用该组中的四行之一。如果您有五个数组以相同的地址开始，以该几何为模（或基本上重叠），那么它们将争夺每组中的四行，并在它们进行时相互排斥。当数组在内存中连续分配时可以避免这种情况，就像编译器通常在一个接一个地简单地声明数组时所做的那样，但可能会出现复杂情况。没有可运行的代码，很难确定。

将数组对齐到 16 字节的倍数很好，可能会有所帮助。在某些情况下，它有很大帮助。如果可能，许多 vDSP 例程会处理一些初始元素以达到良好对齐的边界，然后使用快速 SIMD 代码直到接近数组末尾，此时可能需要单独处理另外一些元素。然而，这并不总是可能的，因为当一个对多个向量进行操作的例程被传递给具有不同对齐方式的向量时。（对齐一个指针的处理元素会使其他指针未对齐。）除了添加 align 属性，对齐数组的另一种方法是使用标准内存分配例程分配它们，例如 malloc。在 Mac OS X 和 iOS 上，malloc 返回 16 字节对齐的地址。

堆栈大小和 ARM 具有有限立即值的事实可能不是问题，向量地址的计算应该是代码中计算的一个微不足道的部分。（此外，ARM 有一些有趣的灵活立即数，而不仅仅是 12 位整数。）

实际函数调用和返回本身的成本可能微不足道。Apple 提供的编译器并没有“比 Accelerate 更好地向量化简单的乘法和加法”，而且函数调用也不是“非常昂贵”。</p>

你省略了步伐。如果它们不是一个，您可能会通过重写代码获得很多好处，以便在调用 vDSP 例程时数据具有单位跨度。

分支预测在这里可能不是问题。

可运行代码将极大地帮助诊断您的性能问题。

score 0 · Accepted Answer

首先：不幸的是，这种对数据放置的敏感性很常见。我们中的一些人编写了尝试多种不同布局的代码

诸如此类的性能损失的常见罪魁祸首是：

分支错误预测
缓存效果
- 容量未命中（只是简单的太多数据，例如 1MB 的数据无法放入 32KB 缓存中）
- 缓存冲突（例如，在 4 路关联 32KB 缓存中，超过 4 个相同的模 8K 地址）
DRAM效果
- DRAM 页面缺失

我无法解析您所说的内容：MAXSIZE 是什么？你说 7*4KB ......但是你有 8 个数组，所以我怀疑你是说 MAXSIZE=1024。你是说MAXSIZE是7*1024？（* 4B / 浮动？）

无论如何：如果每个单独阵列的 MAXSIZE 大约为 28KB，那么对于许多系统来说，您的缓存大小已经接近。在这种情况下，我会怀疑 DRAM 页面效应 - 我会怀疑性能良好的安排将访问次数最多的阵列放在单独的 DRAM 页面中。

你没有说哪个表现更好，但我猜：

float amparray[maxsize];    //these two make the most change
float timearray[maxsize];    //these two make the most change

观察您的代码， timearray 似乎是最容易访问的。如果 timearray 秒的性能更好，并且我对 MAXSIZE 的猜测是正确的，那么我敢打赌这是 DRAM 页面效应。

快速解释：DRAM 具有页和存储体的概念。不要与操作系统页面混淆。每个 DRAM 芯片，因此每个 DIMM，都有 4 或 8 个内部存储体。每个银行可以有一个打开的页面。如果您从同一页面、同一银行访问数据，则速度最快。如果您从不同银行中已打开的页面访问数据，速度快，但比同一银行的同一页面慢如果您需要同一银行中的不同页面，真的很慢。如果您有写回缓存，则写回几乎是随机发生的，因此您可能会获得非常糟糕的页面行为。

但是，如果我猜错了 MAXSIZE，那么可能是缓存效应。

RED FLAG：你说“我没有加入 stride 之类的东西”。步幅因使数据在缓存中表现不佳而臭名昭著。缓存通常是设置关联的，这意味着它们具有我所说的“共振” - 与缓存的共振模数相同的地址将映射到同一组。如果你有比联想更多的东西，你就会痛击。

将共振计算为缓存大小除以关联性。例如，如果您有一个 32K 的 4 路关联缓存，那么您的共振是 8K。

无论如何...如果您只是大步访问事物，那么数组放置可能很重要。例如，假设您的步幅为 16。即访问元素 0、16、32、48 等。如果 MAXSIZE 为 7*1024，正如我上面猜到的，那么元素

float trigarray1[maxsize];
float trigarray2[maxsize];
float trigarray3[maxsize];
float trigarray4[maxsize];
float trigarray5[maxsize];
float temparray[maxsize];
float amparray[maxsize];    //these two make the most change
float timearray[maxsize];    //these two make the most change

那么以下数组将发生冲突 - 它们的跨步访问模式将映射到相同的集合：

trigarray1, trigarray5
trigarray2, temparray
trigarray3, amparray
trigarray4, timearray,

如果你交换 amparray 和 timearray，那么

   trigarray3 will conflict with timearray
and
   trigarray4 with amparray

trigarray4 和 timarray 似乎是最常用的，所以我猜，如果你的步幅像 0、16、32、348，或者实际上任何以 0 开头的步幅，那么这两个数组冲突就是你的问题。

但是，您可能有不同的步幅模式：0、16、32、48 ... 在一个数组中，而 1,17,33,... 在另一个数组中。那么不同的数组对就会发生冲突。

--

我没有足够的信息在这里诊断您的问题。

如果您可以使用良好的性能工具，您也许可以自己做。

例如，在 Intel 处理器上，您可以记录我所说的缓存未命中配置文件，记录理想的物理内存地址，计算它们在缓存中映射到的集合，并生成直方图。如果您看到尖峰，那可能是个问题。同样，您可以生成 DRAM 页面未命中或存储库未命中配置文件。我只提到英特尔是因为我设计了一些硬件来实现这种性能测量。同样的事情可能应该在 ARM 上可用（如果没有，也许我可以获得丰富的销售工具来做到这一点...... :-)）。

如果这些是问题，你怎么能解决它？

好吧，正如您在上面解释的那样，通过尝试不同的展示位置。这可以帮助解决跨步（缓存集冲突）和 DRAM 页面问题。

如果步幅有问题，您可以尝试使数组大小有点不同 - MAXSIZE + 4、MAXSIZE 8 等。这可以有效地抵消步幅。（在超级计算机代码中看到大小为 255 或 257 的数组很常见，这与偏移跨步访问模式的原因相同，以免发生冲突。）

score 0 · Accepted Answer

这里只是猜测。结盟？

这些库应该使用 SIMD 指令，即使在某些不需要对齐的情况下，这些指令的时间也取决于对齐。

缓存行对齐也可能起作用，也可能不起作用。

这些数组是在堆栈上分配的，这意味着除了 sizeof(float) 内在保证和第一个对象的体系结构保证之外，您几乎无法控制该数据的对齐方式（如果第一个局部变量实际上保证了 64 位对齐）您以 64 位模式编译）。

您可以尝试通过打印/记录地址来验证数据对齐方式。并通过定义一个结构来保存数据并使用 malloc 为其获取内存（获得比您需要的更多的内存，以便您可以将结构放置在内存块中的不同偏移量处，特别是如果您想玩缓存线对齐）。

score 0 · Accepted Answer

可能与分支预测以及数组中的元素有关。

请参阅这篇文章以获取令人敬畏的参考。您的帖子可能与此帖子类似，通过以一个顺序声明您的数组，数据显示为“排序”，但在另一个顺序中，它不是。

为什么处理排序数组比处理未排序数组更快？

arrays - Why does order of array declaration affect performance so much?

5 回答 5

Related

Reference