2

I need to load values from uint8 array into 128 NEON register. There is a similar question. But there were no good answers.

My solution is:

uint8_t arr[4] = {1,2,3,4};

//load 4 of 8-bit vals into 64 bit reg
uint8x8_t _vld1_u8 = vld1_u8(arr);

//convert to 16-bit and move to 128-bit reg
uint16x8_t _vmovl_u8 = vmovl_u8(_vld1_u8);

//get low 64 bit and move them to 64-bit reg
uint16x4_t _vget_low_u16 = vget_low_u16(_vmovl_u8);

//convert to 32-bit and move to 128-bit reg
uint32x4_t ld32x4 = vmovl_u16(_vget_low_u16);

This works fine, but it seems to me that this approach is not the fastest. Maybe there is a better and faster way to load 8bit data into 128 reg as 32bit ?

Edit:

Thanks to @FrankH. I've came up with the second version using some hack:

uint8x16x2_t z = vzipq_u8(vld1q_u8(arr), q_zero);
uint8x16_t rr = *(uint8x16_t*)&z;
z = vzipq_u8(rr, q_zero);
ld32x4 = *(uint8x16_t*)&z;

It boils down to this assembly (when compiler optimisations are on):

vld1.8 {d16, d17}, [r5]
vzip.8 q8, q9
vorr   q9, q4, q4
vzip.8 q8, q9

So there are no redundant stores and it's pretty fast. But still it is about x1.5 slower then the first solution.

4

1 回答 1

1

你可以用零做一个“双拉链”:

uint16x4_t zero = 0;

uint32x4_t ld32x4 =
    vreinterpretq_u32_u16(
        vzipq_u8(
            vzip_u8(
                vld1_u8(arr),
                vreinterpret_u8_u16(zero)
            ),
            zero
        )
    );

由于它们vreinterpretq_*()是无操作的,因此可以归结为三个指令。目前没有交叉编译器,无法验证:(

编辑: 不要误会我的意思......虽然vreinterpretq_*()没有产生霓虹灯指令,但它不是无操作;那是因为它会阻止编译器执行如果您改用widerVal.val[0]. 它告诉编译器的只是:

“你有一个uint8x16x2_t,但我只想用它的一半作为一个uint8x16_t,给我一半的寄存器。”

或者:

“你有一个uint8x16x2_t,但我想用那些 regsuint32x4_t代替。”

即,它告诉编译器对霓虹寄存器集进行别名设置 -如果您通过语法进行显式子集访问,则会阻止存储/加载到堆栈/从堆栈加载。.val[...]

在某种程度上,.val[...]语法“是一个黑客”,但更好的方法是使用vreinterpretq_*(),“看起来像一个黑客”。使用它会导致更多指令和更慢/更差的代码。

于 2013-07-23T13:04:51.487 回答