我一直在根据一个粗略的经验法则进行操作,如果您有足够的数据进行操作,Q-form ASIMD 指令与 D-form 一样好或更好。因此,在阅读Cortex-A72 软件优化指南的第 3.15 节时,我很惊讶地看到FADDP
,D 型的吞吐量为 2,Q 型的吞吐量为 2/3(作为参考,延迟分别为 4 和 7)。对于 D 和 Q 形式具有不同性能的所有其他指令在最坏的情况下只有很小的延迟差异(例如 3 对 4 FRINTX
),并且具有相同或正好一半的吞吐量。
有什么特别之处FADDP
在于它的吞吐量对于 Q-form 减少了三分之一,并且(如果你有前端带宽)你真的可以通过用两个 D-form 指令替换 Q-form 来增加吞吐量吗?
测试和基准测试:
我编写了几个 c++ 函数来尝试以两种方式锻炼 cortex-a72:
void q(float *i) {
auto x = vld1q_f32_x4(i);
auto y = vld1q_f32_x4(i + 16);
for (int i = 0; i < 8192; ++i) {
x.val[0] = vpaddq_f32(x.val[0], y.val[0]);
x.val[1] = vpaddq_f32(x.val[1], y.val[1]);
x.val[2] = vpaddq_f32(x.val[2], y.val[2]);
x.val[3] = vpaddq_f32(x.val[3], y.val[3]);
y.val[0] = vpaddq_f32(x.val[0], y.val[0]);
y.val[1] = vpaddq_f32(x.val[1], y.val[1]);
y.val[2] = vpaddq_f32(x.val[2], y.val[2]);
y.val[3] = vpaddq_f32(x.val[3], y.val[3]);
}
vst1q_f32_x4(i, x);
}
void d(float *i) {
auto x0 = vld1_f32_x4(i);
auto x1 = vld1_f32_x4(i + 8);
auto y0 = vld1_f32_x4(i + 16);
auto y1 = vld1_f32_x4(i + 24);
for (int i = 0; i < 8192; ++i) {
x0.val[0] = vpadd_f32(x0.val[0], x0.val[1]);
x0.val[1] = vpadd_f32(y0.val[0], y0.val[1]);
x0.val[2] = vpadd_f32(x0.val[2], x0.val[3]);
x0.val[3] = vpadd_f32(y0.val[2], y0.val[3]);
x1.val[0] = vpadd_f32(x1.val[0], x1.val[1]);
x1.val[1] = vpadd_f32(y1.val[0], y1.val[1]);
x1.val[2] = vpadd_f32(x1.val[2], x1.val[3]);
x1.val[3] = vpadd_f32(y1.val[2], y1.val[3]);
y0.val[0] = vpadd_f32(x0.val[0], x0.val[1]);
y0.val[1] = vpadd_f32(y0.val[0], y0.val[1]);
y0.val[2] = vpadd_f32(x0.val[2], x0.val[3]);
y0.val[3] = vpadd_f32(y0.val[2], y0.val[3]);
y1.val[0] = vpadd_f32(x1.val[0], x1.val[1]);
y1.val[1] = vpadd_f32(y1.val[0], y1.val[1]);
y1.val[2] = vpadd_f32(x1.val[2], x1.val[3]);
y1.val[3] = vpadd_f32(y1.val[2], y1.val[3]);
}
vst1_f32_x4(i, x0);
vst1_f32_x4(i + 8, x1);
}
当使用 clang 和 -O3 编译时,它们会产生以下结果:
0000000000400a84 <_Z1qPf>:
400a84: aa0003e8 mov x8, x0
400a88: 4cdf2900 ld1 {v0.4s-v3.4s}, [x8], #64
400a8c: 4c402904 ld1 {v4.4s-v7.4s}, [x8]
400a90: 52840008 mov w8, #0x2000 // #8192
400a94: 4ea21c54 mov v20.16b, v2.16b
400a98: 4ea11c35 mov v21.16b, v1.16b
400a9c: 4ea01c10 mov v16.16b, v0.16b
400aa0: 4ea61cd6 mov v22.16b, v6.16b
400aa4: 4ea51cb7 mov v23.16b, v5.16b
400aa8: 4ea41c98 mov v24.16b, v4.16b
400aac: 6e38d610 faddp v16.4s, v16.4s, v24.4s
400ab0: 6e37d6b5 faddp v21.4s, v21.4s, v23.4s
400ab4: 6e36d694 faddp v20.4s, v20.4s, v22.4s
400ab8: 6e27d463 faddp v3.4s, v3.4s, v7.4s
400abc: 71000508 subs w8, w8, #0x1
400ac0: 6e38d618 faddp v24.4s, v16.4s, v24.4s
400ac4: 6e37d6b7 faddp v23.4s, v21.4s, v23.4s
400ac8: 6e36d696 faddp v22.4s, v20.4s, v22.4s
400acc: 6e27d467 faddp v7.4s, v3.4s, v7.4s
400ad0: 54fffee1 b.ne 400aac <_Z1qPf+0x28> // b.any
400ad4: 4eb51eb1 mov v17.16b, v21.16b
400ad8: 4eb41e92 mov v18.16b, v20.16b
400adc: 4ea31c73 mov v19.16b, v3.16b
400ae0: 4c002810 st1 {v16.4s-v19.4s}, [x0]
400ae4: d65f03c0 ret
0000000000400ae8 <_Z1dPf>:
400ae8: fc1c0fee str d14, [sp, #-64]!
400aec: 6d0133ed stp d13, d12, [sp, #16]
400af0: 6d022beb stp d11, d10, [sp, #32]
400af4: 6d0323e9 stp d9, d8, [sp, #48]
400af8: aa0003e8 mov x8, x0
400afc: 0cdf2900 ld1 {v0.2s-v3.2s}, [x8], #32
400b00: 91010009 add x9, x0, #0x40
400b04: 0c402930 ld1 {v16.2s-v19.2s}, [x9]
400b08: 91018009 add x9, x0, #0x60
400b0c: 0c402904 ld1 {v4.2s-v7.2s}, [x8]
400b10: 0c402934 ld1 {v20.2s-v23.2s}, [x9]
400b14: 52840009 mov w9, #0x2000 // #8192
400b18: 4ea11c29 mov v9.16b, v1.16b
400b1c: 4ea01c18 mov v24.16b, v0.16b
400b20: 4ea51ca8 mov v8.16b, v5.16b
400b24: 4ea41c9c mov v28.16b, v4.16b
400b28: 4eb21e4b mov v11.16b, v18.16b
400b2c: 4eb11e2a mov v10.16b, v17.16b
400b30: 4eb01e0e mov v14.16b, v16.16b
400b34: 4eb61ecd mov v13.16b, v22.16b
400b38: 4eb71eec mov v12.16b, v23.16b
400b3c: 2e23d442 faddp v2.2s, v2.2s, v3.2s
400b40: 2e27d4c6 faddp v6.2s, v6.2s, v7.2s
400b44: 2e29d718 faddp v24.2s, v24.2s, v9.2s
400b48: 2e2ad5c9 faddp v9.2s, v14.2s, v10.2s
400b4c: 2e28d79c faddp v28.2s, v28.2s, v8.2s
400b50: 2e35d688 faddp v8.2s, v20.2s, v21.2s
400b54: 2e33d563 faddp v3.2s, v11.2s, v19.2s
400b58: 2e2cd5a7 faddp v7.2s, v13.2s, v12.2s
400b5c: 2e29d70e faddp v14.2s, v24.2s, v9.2s
400b60: 2e28d794 faddp v20.2s, v28.2s, v8.2s
400b64: 2e23d44b faddp v11.2s, v2.2s, v3.2s
400b68: 2e27d4cd faddp v13.2s, v6.2s, v7.2s
400b6c: 71000529 subs w9, w9, #0x1
400b70: 2e2ad5ca faddp v10.2s, v14.2s, v10.2s
400b74: 2e35d695 faddp v21.2s, v20.2s, v21.2s
400b78: 2e33d573 faddp v19.2s, v11.2s, v19.2s
400b7c: 2e2cd5ac faddp v12.2s, v13.2s, v12.2s
400b80: 54fffde1 b.ne 400b3c <_Z1dPf+0x54> // b.any
400b84: 4ea91d39 mov v25.16b, v9.16b
400b88: 4ea81d1d mov v29.16b, v8.16b
400b8c: 4ea21c5a mov v26.16b, v2.16b
400b90: 4ea61cde mov v30.16b, v6.16b
400b94: 4ea31c7b mov v27.16b, v3.16b
400b98: 4ea71cff mov v31.16b, v7.16b
400b9c: 0c002818 st1 {v24.2s-v27.2s}, [x0]
400ba0: 0c00291c st1 {v28.2s-v31.2s}, [x8]
400ba4: 6d4323e9 ldp d9, d8, [sp, #48]
400ba8: 6d422beb ldp d11, d10, [sp, #32]
400bac: 6d4133ed ldp d13, d12, [sp, #16]
400bb0: fc4407ee ldr d14, [sp], #64
400bb4: d65f03c0 ret
在我看来,那些主循环似乎没有找到任何技巧来避免计算,而且它只是一条直线 8 q-form faddp's vs 16 d-form。
使用 perf 进行基准测试时的结果如下:
Clocks per call
==============================
q d
98631 90285
这并没有完全达到文档建议的收益(q 实际上非常接近文档建议的 8192 * 8 faddp 应该采用的理论 98304 周期,d 必须遇到延迟问题,这并不令人惊讶,因为有0x400b4c 和 0x400b60 之间的依赖关系,它们之间只有 4 条指令)。但是,这些收益似乎暗示着 d-form 有一些优势。