I'm not exactly sure what happens when I call _mm_load_ps
? I mean I know I load an array of 4 floats into a __m128
, which I can use to do SIMD accelerated arithmetic and then store them back, but isn't this __m128
data type still on the stack? I mean obviously there aren't enough registers for arbitrary amounts of vectors to be loaded in. So these 128 bits of data are moved back and forth each time you use some SIMD instruction to make computations? If so, than what is the point of _mm_load_ps
?
Maybe I have it all wrong?