parsing - 从程序集文件创建控制流图

Question

我想使用 C 语言从汇编文件创建控制流图（CFG）。我一直在考虑它，这些是我的想法： 1. 创建块 - 逐行处理汇编文件 - 找到重要的指令，如函数名称、块名称、跳转指令、调用指令或离开/返回，也许还有其他一些 - 找到它们也许用正则表达式？但我还没有在 Windows 上找到 C 的正则表达式实现。- 匹配上述指令后，将匹配前的指令保存到某个结构，这是我的块 2. 创建 CFG - somhow 从块创建 CFG 但我不知道

谁能给我一些建议怎么做？此外，如果有更好的语言可以做到这一点，如果你告诉我，我会很高兴。
感谢您的时间和帮助。

score 3 · Accepted Answer

What OP needs is a disciplined approach to accomplishing this task.

He needs a good parser for the assembler source, so he knows he has an accurate representation of it. Beyond the pure parse part, he'll have to emulate the assembler pretty completely, including all the complications, such as macros, conditional blocks, multiple location counters, absolute/relative/external symbols, etc. (Build a good parser by relying solely on regexes isn't going to work.)

He'll then need to compute first estimate of the control flow graph by inspecting the sequences of machine instructions and branches. This may be harder to do than it looks; in big, complicated assembly codes people abuse entry points into procedures so that it is sometimes hard to tell what's an instruction, and what is data.

(Here's a trick I use in a big x86 application. I want to add sanity checking to my code in many places. The sanity tests look like this:

  <test for some sane condition>
  jf  location+3       ; this branchs to a breakpoint in the middle of the next instruction
  cmp  al, 0xCC        ; the immediate value is a BREAKPOINT opcode

They're compact, and a breakpoint occurs when some bad happens. But when analyzing this program for control flow, the "jmp false" sometimes branches into what appears to be the middle of an instruction. How will OP model that?)

The next complication are pointers to code. Assembler code often generates many pointers to other other instructions, and then hides those pointers in various places (the call instruction pushes them onto the data stack for the x86), retrieves them, and then does "jmp indirect". If you want to know where that jmp might go, you need to track the possible values a memory location might contain, which means you need to do a data flow analysis (how do values get there, and from where) and combine with call graph construction (can't get to that function? OK, then where it goes won't affect this code) to compute an answer that is reasonable.

Doing all this by ad hoc methods will end up producing inaccurate (useless) answers. OP needs to find a framework in which to build his parser, and implement good quality points-to analysis algorithms if he hopes to get a good result.

C isn't specifically designed to support this task. It can do it with enough additional sweat, but that's true for any programming language.

(Check my bio for such a framework. OP can use any framework that works for him).

score 2 · Accepted Answer

一个更简单的方法是组装汇编文件，然后反汇编它。大多数解析问题都被消除了。反汇编将具有用于标签、操作码和操作数的固定列，因此只需要很少的解析。

使用拆卸，执行两遍。第一步是创建一个数据结构来表示每条指令并收集所有跳转目标。

第二遍是创建表示基本块的结构（具有单个入口点和出口的代码块）。将每个基本块链接到其后继块。一个基本块可以有零个、一个或两个后继（或者在跳转表的情况下有 N 个后继）。以 RET 结尾的基本块有零个后继。以无条件跳转结尾的基本块有一个后继。一个带有条件跳转的基本块有两个后继——要么落空，要么跳转目标。没有前驱的基本块要么是子程序入口点，要么是死（或无效）代码。

跳转目标是基本块的开始，无条件跳转之后的指令也是如此（应该是跳转目标或子程序入口点）。

如果不能确定地知道间接跳转的目标（通过真实硬件或仿真器）运行程序是不可行的。我建议支持几个简单的情况：1）表在程序内的跳转表 2）通过用于链接到另一个可执行文件的全局内存位置的跳转（反汇编程序列出目标是什么）。在第一种情况下，基本块可以有任意多个后继块。在第二种情况下，基本块只有一个后继。

请注意，我故意将 CALL 作为 CFG 的一部分省略。当我实现了一个 CFG 绘图仪时，我就是这么做的。我的绘图员一次只显示一个函数。如果您双击一个 CALL，则显示子程序的 CFG。

但是，如果您想将整个子程序树包含在单个 CFG 中，则 CALL 将是基本块的结尾，而 CALL 之后的指令将是基本块的开始。请注意，除了最简单的程序之外，很难查看程序的整个 CFG。

我忽略了 INT 和 IRET，因为我假设您正在处理用户模式应用程序。如果不是，则将 INT 视为调用，将 IRET 视为 RET。可以从启用中断的任何地方调用硬件中断服务例程 (ISR)，因此（通常）不会直接调用 ISR - 它只是坐在一边。更一般地说，如果您正在处理内核模式软件，您将有各种其他考虑因素。

score 1 · Accepted Answer

我遇到了和你一样的问题，没有找到任何现成的解决方案，所以我用 Python 自己写了一个简单的解决方案：https ://github.com/Kazhuu/asm2cfg 。

请注意，我只使用来自 GDB 的函数反汇编转储对其进行了测试。我认为这可以扩展到与 objdump 一起使用。

parsing - 从程序集文件创建控制流图

3 回答 3

Related

Reference