不久前,我在某处读到 SSE 内在函数编译成高效的机器代码,因为编译器对它们的处理与普通函数不同。我正在徘徊编译器实际上是如何做的,以及 C 程序员可以做些什么来促进这个过程。是否有关于如何以使编译器更轻松地生成高效机器代码的方式使用内部函数的指南。
谢谢。
不久前,我在某处读到 SSE 内在函数编译成高效的机器代码,因为编译器对它们的处理与普通函数不同。我正在徘徊编译器实际上是如何做的,以及 C 程序员可以做些什么来促进这个过程。是否有关于如何以使编译器更轻松地生成高效机器代码的方式使用内部函数的指南。
谢谢。
Contrary to what Necrolis wrote, the intrinsics may or may not compile down to the instructions they represent. This is especially true for copy or load instructions such as _mm_load_pd
, since the compiler is still responsible for register allocation and assignment when using intrinsics. This means that copying a value from one location to another may not be necessary at all, if the two locations can be represented by the same register. In that case the compiler may choose to remove the copy. It may also choose to remove other instructions if the result is never used.
Check out this blog post where the behavior of different compilers is compared in practice. It's from 2009, so the details may no longer apply. However, newer compilers are likely to optimize your code more, not less.
As for actually use intrinsics efficiently, the answer is the same as for all other performance optimization: Measure, measure and measure. Make sure that you are actually dealing with a hot piece of code, find out why it's slow and then improve it. You are very likely to find that improving your memory access patterns is more important than using intrinsics.