assembly - How to generate the machine code of Thumb instructions?

Question

I searched Google for generating machine code of ARM instructions, such as this one Converting very simple ARM instructions to binary/hex

The answer referenced ARM7TDMI-S Data Sheet (ARM DDI 0084D). The diagram of data processing instructions is good enough. Unfortunately, it's for ARM instructions, not for Thumb/Thumb-2 instructions.

Take the B instruction as an example. ARM Architecture Reference Manual - ARMv7-A and ARMv7-R edition section A8.8.18, Encoding T4:

For the assembly code:

B 0x50

How can I encode the immediate value 0x50 into the 4-byte machine code? Or if I want to write a C function that takes the B instruction and the as inputs, and return the encoded machine code. How can I implement such a function?

unsigned int gen_mach_code(int instruction, int relative_addr)
{
    /* the int instruction parameter is assumed to be B */
    /* encoding method is assumed to be T4 */
    unsigned int mach_code;
    /* construc the machine code of B<c>.W <label> */
    return mach_code;
}

I know the immediate values encoding on ARM. Here http://alisdair.mcdiarmid.org/arm-immediate-value-encoding/ is a good tutorial.

I just want to know where is the imm10 and imm11 from, and how to construct the full machine code with them.

score 3 · Accepted Answer

首先，ARM7TDMI 不支持 thumb2 扩展，而是基本上定义了原始的 thumb 指令集。

那么为什么不试试呢？

.thumb
@.syntax unified

b 0x50

运行这些命令

arm-whatever-whatever-as b.s -o b.o
arm-whatever-whatever-objdump -D b.o

得到这个输出

0:  e7fe        b.n 50 <*ABS*0x50>

所以这是一种 T2 编码，正如较新的文档所示，对于 ARMv4T、ARMv5T*、ARMv6*、ARMv7 支持的这条指令，ARM7TDMI 是 ARMv4t

所以我们看到 E7 与该指令定义的 11100 开始匹配，因此 imm11 是 0x7FE。这基本上是分支到地址 0x000 的编码，因为它没有与任何东西链接。我怎么知道？

.thumb
b skip
nop
nop
nop
nop
nop
skip:

00000000 <skip-0xc>:
   0:   e004        b.n c <skip>
   2:   46c0        nop         ; (mov r8, r8)
   4:   46c0        nop         ; (mov r8, r8)
   6:   46c0        nop         ; (mov r8, r8)
   8:   46c0        nop         ; (mov r8, r8)
   a:   46c0        nop         ; (mov r8, r8)

0xe004 以 11100 开头，所以这是一个分支编码 T2。imm11 是 4

我们需要从 0 到 0xC。应用偏移量时，pc 提前两条指令。文档说

Encoding T2 Even numbers in the range –2048 to 2046

和

PC, the program counter 
- When executing an ARM instruction, PC reads as the address of the current instruction plus 8. • When executing a
- Thumb instruction, PC reads as the address of the current instruction
plus 4.

所以这一切都是有道理的。0xC-0x4 = 8。我们只能做偶数，无论如何分支到指令的中间是没有意义的，所以除以 2，因为拇指指令是两个字节（偏移量是指令而不是字节）。所以给出了 4

0xE004

这是生成 t4 编码的一种方法

.thumb
.syntax unified

b skip
nop
nop
nop
nop
nop
skip:

00000000 <skip-0xe>:
   0:   f000 b805   b.w e <skip>
   4:   46c0        nop         ; (mov r8, r8)
   6:   46c0        nop         ; (mov r8, r8)
   8:   46c0        nop         ; (mov r8, r8)
   a:   46c0        nop         ; (mov r8, r8)
   c:   46c0        nop         ; (mov r8, r8)

分支的 T4 编码是第一个半字顶部的 11110，表示这是一条未定义的指令（不是 ARMv6T2、ARMv7 的任何指令）或 ARMv6T2、ARMv7 的 thumb2 扩展

第二个半字 10x1，我们看到一个 B，所以看起来不错，这是一个 thumb2 扩展分支。

S 是 0 imm10 是 0 j1 是 1 j2 是 1 而 imm11 是 5

I1 = NOT(J1 EOR S); I2 = NOT(J2 EOR S); imm32 = SignExtend(S:I1:I2:imm10:imm11:’0’, 32);

1 EOR 0 是 1 对吗？不是你得到 0。所以 I1 和 I2 都是零，s 是零 imm10 是零。所以我们基本上在这个上只看 imm11 作为一个正数

执行时 pc 领先四位，所以 0xE - 0x4 = 0xA。

0xA / 2 = 0x5 那就是我们的分支偏移偏移 pc + (5*2)

.syntax unified
.thumb


b.w skip
nop
here:
nop
nop
nop
nop
skip:
b.w here

00000000 <here-0x6>:
   0:   f000 b805   b.w e <skip>
   4:   46c0        nop         ; (mov r8, r8)

00000006 <here>:
   6:   46c0        nop         ; (mov r8, r8)
   8:   46c0        nop         ; (mov r8, r8)
   a:   46c0        nop         ; (mov r8, r8)
   c:   46c0        nop         ; (mov r8, r8)

0000000e <skip>:
   e:   f7ff bffa   b.w 6 <here>

s 是 1，imm10 是 0x3FF j1 是 1 j2 是 1 imm1 是 0x7FA

1 eor 1 是 0 不是你得到 1 的 i1 和相同的 i2

imm32 = SignExtend(S:I1:I2:imm10:imm11:’0’, 32);

s 是 1，所以这将符号扩展 1，除了最后几位之外都是 1，因此 imm32 是 0xFFFFFFFA 或 -6 指令或 -12 字节

所以我们的偏移量也是 ((0xE + 4) - 6)/2 = 6。或者从指令编码 PC - (6*2) = (0xE + 4) - 12 = 6 分支到 0x6 的另一种方式来看。

所以如果你想分支到 0x70 并且指令的地址是 0x12 那么你的偏移量是 0x70-(0x12+4) = 0x62 或 0x31 指令，我们从跳过中知道诀窍是让 s 0 和 j1 和 j2一个 1

0x12: 0xF000 0xB831  branch to 0x70

所以现在知道我们可以回到这个：

0:  e7fe        b.n 50 <*ABS*0x50>

偏移量是符号扩展 0x7FE 或 0xFFFFFFFE。0xFFFFFFFE*2 + 4 = 0xFFFFFFFC + 4 = 0x00000000。分支到 0

添加一个 nop

.thumb
nop
b 0x50

00000000 <.text>:
   0:   46c0        nop         ; (mov r8, r8)
   2:   e7fe        b.n 50 <*ABS*0x50>

相同的编码

所以反汇编意味着绝对值 0x50 但没有对其进行编码，链接并没有帮助它只是抱怨

(.text+0x0): relocation truncated to fit: R_ARM_THM_JUMP11 against `*ABS*0x50'

这个

.thumb
nop
b 0x51

给出相同的编码。

所以基本上这种语法有问题和/或它正在寻找一个名为 0x50 的标签？

我希望您的示例是您想知道某个地址的分支编码，而不是确切的语法。

arm 不像其他一些指令集，分支总是相对的。因此，如果您可以根据编码到达目的地，那么您将获得一个分支，否则，您必须使用 bx 或 pop 或其他方式之一来修改 pc（具有绝对值）。

知道文档中的 T2 编码只能提前达到 2048，然后在分支与其目的地之间放置超过 2048 个 nop

b.s: Assembler messages:
b.s:5: Error: branch out of range

也许这就是你想要做的？

.thumb
mov r0,#0x51
bx r0

00000000 <.text>:
   0:   2051        movs    r0, #81 ; 0x51
   2:   4700        bx  r0

跳转到绝对地址 0x50。对于该特定地址，无需 thumb2 扩展。

.thumb
ldr r0,=0x12345679
bx r0
00000000 <.text>:
   0:   4800        ldr r0, [pc, #0]    ; (4 <.text+0x4>)
   2:   4700        bx  r0
   4:   12345679    eorsne  r5, r4, #126877696  ; 0x7900000

分支到地址 0x12345678 或任何其他可能的地址。

score 0 · Accepted Answer

谢谢@dwelch，但我不太了解你。我为我的无知道歉...

我尝试使用按位运算对 B 指令进行编码/解码，虽然非常简单和愚蠢:) 下面的代码现在似乎可以工作了。@小丑

#define MAX_CODE_LEN 4
typedef unsigned char uchar;
typedef unsigned int uint;

static int decode_B_T4(const int code)
{
    const int S = (code & (1 << 26)) ? 1 : 0;      /* test bit [26] */
    const int J1 = (code & (1 << 13)) ? 1 : 0;     /* test bit [13] */
    const int J2 = (code & (1 << 11)) ? 1 : 0;     /* test bit [11] */
    const int imm10 = (code >> 16) & 0b1111111111; /* extract imm10 */
    const int imm11 = code & 0b11111111111;        /* extract imm11 */
    const int I1 = (~(J1 ^ S)) & 1;
    const int I2 = (~(J2 ^ S)) & 1;
    int offset = 0;
    offset |= I1 << 23;
    offset |= I2 << 22;
    offset |= imm10 << 12;
    offset |= imm11 << 1;
    if (S) {
        offset |= 0b11111111 << 24;               /* sign extend */
    }
    return offset;
}

static int encode_B_T4(const int src_addr, const int dst_addr, uchar* buf)
{
    assert(buf != NULL);
    uint code;
    const int code_len = 4;                           /* 4 bytes */
    const int offset = (dst_addr & (~1)) - (src_addr & (~1)) - 4;
    const int S = offset < 0;                         /* sign */
    const int I1 = offset & (1 << 23) ? 1 : 0;        /* test bit [23] */
    const int I2 = offset & (1 << 22) ? 1 : 0;        /* test bit [22] */
    const int imm10 = (offset >> 12) & 0b1111111111;  /* extract imm10 */
    const int imm11 = (offset >> 1) & 0b11111111111;  /* extract imm11 */
    const int J1 = ((~I1 & 1) ^ S) & 1;
    const int J2 = ((~I2 & 1) ^ S) & 1;
    code = 0b11110 << 27;                             /* set the 5 MSB */
    code |= S << 26;
    code |= imm10 << 16;
    code |= 1 << 15;
    code |= J1 << 13;
    code |= 1 << 12;
    code |= J2 << 11;
    code |= imm11;
    assert(code_len <= MAX_CODE_LEN);
    memcpy(buf, &code, code_len);
    return code_len;
}

assembly - How to generate the machine code of Thumb instructions?

2 回答 2

Related

Reference