c - 如何在 ARM 处理器上随机访问字对齐数据？

Question

至少达到 ARMv5 的 ARM CPU 不允许随机访问非字对齐的内存地址。问题在这里详细描述： //lecs.cs.ucla.edu/wiki/index.php/XScale_alignment - 一种解决方案是重写您的代码或首先考虑这种对齐方式。然而并没有说如何。给定一个字节流，其中我有 2 个或 4 个字节的整数，这些整数在流中不是字对齐的。如何在不损失太多性能的情况下以智能方式访问这些数据？

我有一个说明问题的代码片段：

#include <stdio.h>
#include <stdlib.h>

#define BUF_LEN 17

int main( int argc, char *argv[] ) {
    unsigned char   buf[BUF_LEN];
    int             i;
    unsigned short  *p_short;
    unsigned long   *p_long;

    /*  fill array  */
    (void) printf( "filling buffer:" );
    for ( i = 0; i < BUF_LEN; i++ ) {
        /* buf[i] = 1 << ( i % 8 ); */
        buf[i] = i;
        (void) printf( " %02hhX", buf[i] );
    }
    (void) printf( "\n" );

    /*  testing with short  */
    (void) printf( "accessing with short:" );
    for ( i = 0; i < BUF_LEN - sizeof(unsigned short); i++ ) {
        p_short = (unsigned short *) &buf[i];
        (void) printf( " %04hX", *p_short );
    }
    (void) printf( "\n" );

    /*  testing with long   */
    (void) printf( "accessing with long:" );
    for ( i = 0; i < BUF_LEN - sizeof(unsigned long); i++ ) {
        p_long = (unsigned long *) &buf[i];
        (void) printf( " %08lX", *p_long );
    }
    (void) printf( "\n" );

    return EXIT_SUCCESS;
}

在 x86 CPU 上，这是输出：

filling buffer: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10
accessing with short: 0100 0201 0302 0403 0504 0605 0706 0807 0908 0A09 0B0A 0C0B 0D0C 0E0D 0F0E
accessing with long: 03020100 04030201 05040302 06050403 07060504 08070605 09080706 0A090807 0B0A0908 0C0B0A09 0D0C0B0A 0E0D0C0B 0F0E0D0C

在 ATMEL AT91SAM9G20 ARMv5 内核上，我得到了（注意：这是这个 CPU 的预期行为！）：

filling buffer: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10
accessing with short: 0100 0100 0302 0302 0504 0504 0706 0706 0908 0908 0B0A 0B0A 0D0C 0D0C 0F0E
accessing with long: 03020100 00030201 01000302 02010003 07060504 04070605 05040706 06050407 0B0A0908 080B0A09 09080B0A 0A09080B 0F0E0D0C

因此，鉴于我想要或必须访问未对齐地址的字节流：我将如何在 ARM 上有效地做到这一点？

score 2 · Accepted Answer

您编写自己的打包/解包函数，在对齐的变量和未对齐的字节流之间进行转换。例如，

void unpack_uint32(uint8_t* unaligned_stream, uint32_t* aligned_var)
{
  // copy byte-by-byte from stream to var, you can fill in the details
}

score 1 · Accepted Answer

您的示例将演示任何平台上的问题。当然是简单的修复：

unsigned char   *buf;
int             i;
unsigned short  *p_short;
unsigned long   p_long[BUF_LEN>>2];

如果您不能以更好的对齐方式组织数据（更多的字节有时可以等于更好的性能），那么做显而易见的事情并将所有内容都处理为 32 位并从那里切掉部分，优化器将为短裤和一个单词中的字节（实际上包括结构中的字节和short，无论是结构还是从内存中挑选的字节，可能会更昂贵，因为与将所有内容都作为单词传递相比，会有额外的指令，您必须进行系统工程）。

提取未对齐单词的示例。（当然必须管理你的字节序）

a = (lptr[offset]<<16)|(lptr[offset+1]>>16);

从 armv4 到现在的所有 arm 内核都允许非对齐访问，大多数默认情况下都打开了异常，但您可以将其关闭。现在较旧的在单词内旋转，但如果我没记错的话，其他人可以抓住其他字节通道。

进行系统工程，进行性能分析，并确定将所有内容作为单词移动是更快还是更慢。数据的实际移动会产生一些开销，但如果一切都对齐，两边的代码会运行得更快。您是否可以忍受一些 X 倍的数据移动速度，以使该数据的生成和接收速度提高 2 到 4 倍？

score 0 · Accepted Answer

此函数始终使用对齐的 32 位访问：

uint32_t fetch_unaligned_uint32 (uint8_t *unaligned_stream)
{
    switch (((uint32_t )unaligned_stream) & 3u)
    {
        case 3u:
            return ((*(uint32_t *)unaligned_stream[-3]) << 24)
                 | ((*(uint32_t *)unaligned_stream[ 1]) & 0xffffffu);
        case 2u:
            return ((*(uint32_t *)unaligned_stream[-2]) << 16)
                 | ((*(uint32_t *)unaligned_stream[ 2]) & 0x00ffffu);
        case 1u:
            return ((*(uint32_t *)unaligned_stream[-1]) <<  8)
                 | ((*(uint32_t *)unaligned_stream[ 3]) & 0x0000ffu);
        case 0u:
        default:
            return *(uint32_t *)unaligned_stream;
    }
}

它可能比分别读取和移动所有 4 个字节要快。

c - 如何在 ARM 处理器上随机访问字对齐数据？

3 回答 3

Related

Reference