assembly - 使用 Arm 内联 GCC 程序集立即加载 16 位（或更大）

Question

注意：为了简洁起见，这里的例子被简化了，所以它们不能证明我的意图。如果我只是像示例中那样写入内存位置，那么C 将是最好的方法。但是，我正在做的事情在这个例子中我不能使用 C，即使一般来说最好留在 C 中。

我正在尝试使用值加载寄存器，但我坚持使用 8 位立即数。

我的代码：

#include <cstdint>

void a(uint32_t value) {
    *(volatile uint32_t *)(0x21014) = value;
}

void b(uint32_t value) {
    asm (
        "push ip                                \n\t"
        "mov ip,       %[gpio_out_addr_high]    \n\t"
        "lsl ip,       ip,                   #8 \n\t"
        "add ip,       %[gpio_out_addr_low]     \n\t"
        "lsl ip,       ip,                   #2 \n\t"
        "str %[value], [ip]                     \n\t"
        "pop ip                                 \n\t"
        : 
        : [gpio_out_addr_low]  "I"((0x21014 >> 2)     & 0xff),
          [gpio_out_addr_high] "I"((0x21014 >> (2+8)) & 0xff),
          [value] "r"(value)
    );
}

// adding -march=ARMv7E-M will not allow 16-bit immediate
// void c(uint32_t value) {
//     asm (
//         "mov ip,       %[gpio_out_addr]    \n\t"
//         "str %[value], [ip]                     \n\t"
//         : 
//         : [gpio_out_addr]  "I"(0x1014),
//           [value] "r"(value)
//     );
// } 


int main() {
    a(20);
    b(20);
    return 0;
}

当我编写 C 代码（请参阅a()参考资料）时，它会在 Godbolt 中组装成：

a(unsigned char):
        mov     r3, #135168
        str     r0, [r3, #20]
        bx      lr

我认为它使用MOVas 伪指令。当我想在汇编中做同样的事情时，我可以将值放入某个内存位置并使用LDR. 我认为这就是我使用 -march=ARMv7E-M （MOV替换为LDR）时 C 代码的组装方式，但是在许多情况下，这对我来说并不实用，因为我会做其他事情。

在 0x21014 地址的情况下，前 2 位为零，因此当我正确移位它时，我可以将这个 18 位数字视为 16 位，这就是我在中所做的b()，但我仍然必须通过它8 位立即数。但是，在 Keil 文档中，我注意到提到了 16 位立即数：

https://www.keil.com/support/man/docs/armasm/armasm_dom1359731146992.htm

https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm

在 ARMv6T2 及更高版本中，ARM 和 Thumb 指令集都包括：

A MOV instruction that can load any value in the range 0x00000000 to 0x0000FFFF into a register.
A MOVT instruction that can load any value in the range 0x0000 to 0xFFFF into the most significant half of a register, without altering

最低有效一半的内容。

我认为我的 CortexM4 应该是 ARMv7E-M 并且应该满足这个“ARMv6T2 及更高版本”的要求，并且应该能够使用 16 位立即数。

但是，从 GCC 内联汇编文档中，我没有看到这样的提及：

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html

当我启用 ARMv7E-M 架构并取消注释c()我使用常规“I”立即的位置时，我得到一个编译错误：

<source>: In function 'void c(uint8_t)':
<source>:29:6: warning: asm operand 0 probably doesn't match constraints
   29 |     );
      |      ^
<source>:29:6: error: impossible constraint in 'asm'

所以我想知道有没有办法在 GCC 内联汇编中使用 16 位立即数，或者我是否遗漏了一些东西（这会使我的问题无关紧要）？

附带问题，是否可以在 Godbolt 中禁用这些伪指令？我已经看到它们也与 RISC-V 程序集一起使用，但我更愿意查看反汇编的真实字节码，以了解这些伪/宏程序集指令产生的确切指令。

score 3 · Accepted Answer

@Jester 在评论中建议使用i约束来传递更大的立即数或使用真正的 C 变量，用所需的值初始化它并让内联程序集接受它。这听起来像是最好的解决方案，在内联汇编中花费的时间越少越好，人们想要更好的性能往往低估了 C/C++ 工具链在给定正确代码时的优化能力，对于许多重写 C/C++ 代码是回答而不是在汇编中重做所有事情。@Peter Cordes 提到不要使用内联汇编，我同意。然而，在这种情况下，某些指令的确切时序至关重要，我不能冒险让工具链稍微不同地优化某些指令的时序。

Bit-banging 协议并不理想，在大多数情况下，答案是避免 bit-banging，但在我的情况下，它并不是那么简单，其他方法也不起作用：

SPI 不能用于流式传输数据，因为我需要推送更多信号并且具有任意长度，而我的硬件仅支持 8 位/16 位。
尝试使用 DMA2GPIO 并遇到抖动问题。
尝试过 IRQ 处理程序，它的开销太大并且我的性能下降（如下所示，只有 2 个 nop，因此空闲时间没有太多空间可做）。
尝试了预烘焙位流（包括时间），但是对于 1 字节的真实数据，我最终节省了 64 字节的流数据，并且从内存中读取的整体速度要慢得多。
每个写入值的预支持函数（并且对于每个写入值都有一个函数查找表）工作得非常好，实际上太快了，因为现在工具链具有编译时已知值并且能够很好地优化它，我的 TCK高于40MHz。问题是我必须添加很多延迟才能将其减慢到所需的速度（8MHz），并且必须为每个输入值完成，当长度为 8 位或更短时它很好，但是对于 32-位长度无法放入闪存 (2^32 => 4294967296)，并且将单个 32 位访问拼接成四个 8 位访问会在 TCK 信号上引入大量抖动。
在 FPGA 结构中实现这个外设可以让我控制一切，通常这是正确的答案，但想尝试在没有结构的设备上实现它。

长话短说，bit-banging 是不好的，而且大多数情况下有更好的方法来解决它，而使用内联汇编的不必要实际上可能会在不知不觉中产生更糟糕的结果，但就我而言，我需要它。在我之前的代码中，我试图专注于一个关于立即数的简单问题，而不是讨论切线或 XY 问题。

现在回到“将更大的立即数传递给程序集”的主题，这是一个更真实的示例的实现：

https://godbolt.org/z/5vbb7PPP5

#include <cstdint>

const uint8_t TCK = 2;
const uint8_t TMS = 3;
const uint8_t TDI = 4;
const uint8_t TDO = 5;

template<uint8_t number>
constexpr uint8_t powerOfTwo() {
    static_assert(number <8, "Output would overflow, the JTAG pins are close to base of the register and you shouldn't need PIN8 or above anyway");
    int ret = 1;
    for (int i=0; i<number; i++) {
        ret *= 2;
    }
    return ret;
}

template<uint8_t WHAT_SIGNAL>
__attribute__((optimize("-Ofast")))
uint32_t shiftAsm(const uint32_t length, uint32_t write_value) {
    uint32_t addressWrite = 0x40021014; // ODR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)
    uint32_t addressRead  = 0x40021010; // IDR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)

    uint32_t count     = 0;
    uint32_t shift_out = 0;
    uint32_t shift_in  = 0;
    uint32_t ret_value = 0;

    asm volatile (
    "cpsid if                                                  \n\t"  // Disable IRQ
    "repeatForEachBit%=:                                       \n\t"

    // Low part of the TCK
    "and.w %[shift_out],   %[write_value],    #1               \n\t"  // shift_out = write_value & 1
    "lsls  %[shift_out],   %[shift_out],      %[write_shift]   \n\t"  // shift_out = shift_out << pin_shift
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out

    // On the first cycle this is redundant, as it processed the shift_in from the previous iteration.
    // First iteration is safe to do extraneously as it's just doing zeros
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Prepare things that are needed toward the end of the loop, but can be done now
    "orr.w %[shift_out],   %[shift_out],      %[clock_mask]    \n\t"  // shift_out = shift_out | (1 << TCK)
    "lsr   %[write_value], %[write_value],    #1               \n\t"  // write_value = write_value >> 1
    "adds  %[count],       #1                                  \n\t"  // count++
    "cmp   %[count],       %[length]                           \n\t"  // if (count != length) then ....

    // High part of the TCK + sample
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out
    "nop                                                       \n\t"
    "nop                                                       \n\t"
    "ldr   %[shift_in],    [%[gpio_in_addr]]                   \n\t"  // shift_in = GPIO
    "bne.n repeatForEachBit%=                                  \n\t"  // if (count != length) then  repeatForEachBit

    "cpsie if                                                  \n\t"  // Enable IRQ - the critical part finished

    // Process the shift_in as normally it's done in the next iteration of the loop
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Outputs
    : [ret_value]       "+r"(ret_value),
      [count]           "+r"(count),
      [shift_out]       "+r"(shift_out),
      [shift_in]        "+r"(shift_in)

    // Inputs
    : [gpio_out_addr]   "r"(addressWrite),
      [gpio_in_addr]    "r"(addressRead),
      [length]          "r"(length),
      [write_value]     "r"(write_value),
      [write_shift]     "M"(WHAT_SIGNAL),
      [read_shift]      "M"(TDO),
      [clock_mask]      "I"(powerOfTwo<TCK>())

    // Clobbers
    : "memory"
    );

    return ret_value;
}

int main() {
    shiftAsm<TMS>(7,  0xff);                  // reset the target TAP controler
    shiftAsm<TMS>(3,  0x12);                  // go to state some arbitary TAP state
    shiftAsm<TDI>(32, 0xdeadbeef);            // write to target

    auto ret = shiftAsm<TDI>(16, 0x0000);     // read from the target

    return 0;
}

@David Wohlferd 关于减少组装的评论将使工具链有更多机会进一步优化“将地址加载到寄存器中”，以防内联它不应该再次加载地址（因此它们只完成一次）多次调用读/写）。这是启用内联的：

https://godbolt.org/z/K8GYYqrbq

问题是，值得吗？我想是的，我的 TCK 是死点 8MHz，我的占空比接近 50%，而我对保持原样的占空比更有信心。并且采样是在我期望它完成时完成的，而不用担心它会因不同的工具链设置而得到不同的优化。

assembly - 使用 Arm 内联 GCC 程序集立即加载 16 位（或更大）

1 回答 1

Related

Reference