compiler-construction - 直接编码与表驱动的词法分析器？

Question

我是编译器构建领域的新手，我想知道直接编码与表驱动词法分析器之间有什么区别？

如果可能，请使用简单的源代码示例。

谢谢。

编辑：

在《Engineering a Compiler》一书中，作者将词法分析器分为三 (3) 种类型：表驱动、直接编码和手动编码。

score 28 · Accepted Answer

~~我假设您所说的“直接编码”是指手写词法分析器，而不是作为词法分析器生成器的输出生成的词法分析器。在那种情况下......~~（见下文。）

...表驱动的词法分析器是一个（通常是自动生成的）程序，它使用某种查找表来确定要采取的操作。考虑对应于正则表达式的有限自动机ab*a（故意不最小化）：

ab*a 的 DFA

如果我们限制自己只考虑字符“a”和“b”，我们可以将其编码在一个查找表中，如下所示：

#define REJECT -1

/* This table encodes the transitions for a given state and character. */
const int transitions[][] = {
    /* In state 0, if we see an a then go to state 1 (the 1).
     * Otherwise, reject input.
     */
    { /*a*/  1,  /*b*/  REJECT },
    { /*a*/  2,  /*b*/  3      },
    { /*a*/ -1,  /*b*/ -1      }, /* Could put anything here. */
    { /*a*/  2,  /*b*/  3      }
};

/* This table determines, for each state, whether it is an accepting state. */
const int accept[] = { 0, 0, 1, 0 };

现在我们只需要一个实际使用该表的驱动程序：

int scan(void) {
    char ch;
    int state = 0;

    while (!accept[state]) {
        ch = getchar() - 'a'; /* Adjust so that a => 0, b => 1. */
        if (transitions[state][ch] == REJECT) {
            fprintf(stderr, "invalid token!\n");
            return 0; /* Fail. */
        } else {
            state = transitions[state][ch];
        }
    }
    return 1; /* Success! */
}

当然，在真正的词法分析器中，每个标记都会有相应的接受状态，并且您必须以某种方式修改接受表以包含标记标识符。不过，我想强调两点：

表驱动的词法分析器不一定在 DFA 状态级别上运行。
我不建议手动编写表驱动的词法分析器，因为这是一个乏味且容易出错的过程。

手写（因为没有更好的名称）词法分析器通常不使用查找表。假设我们想要一个具有括号、标识符和十进制整数的简单类 Lisp 语言的词法分析器：

enum token {
    ERROR,
    LPAREN,
    RPAREN,
    IDENT,
    NUMBER
};

enum token scan(void) {
    /* Consume all leading whitespace. */
    char ch = first_nonblank();
    if (ch == '(') return LPAREN;
    else if (ch == ')') return RPAREN;
    else if (isalpha(ch)) return ident();
    else if (isdigit(ch)) return number();
    else {
        printf("invalid token!\n");
        return ERROR;
    }
}

char first_nonblank(void) {
    char ch;
    do {
        ch = getchar();
    } while (isspace(ch));
    return ch;
}

enum token ident(void) {
    char ch;
    do {
        ch = getchar();
    } while (isalpha(ch));
    ungetc(ch, stdin); /* Put back the first non-alphabetic character. */
    return IDENT;
}

enum token number(void) {
    char ch;
    do {
        ch = getchar();
    } while (isdigit(ch));
    ungetc(ch, stdin); /* Put back the first non-digit. */
    return NUMBER;
}

与表驱动的词法分析器示例一样，这一个并不完整。一方面，它需要某种缓冲来存储作为标记一部分的字符，例如IDENTand NUMBER。另一方面，它不能特别优雅地处理 EOF。但希望你能明白它的要点。

编辑：根据Engineering a Compiler中的定义，直接编码的词法分析器基本上是两者的混合体。我们不使用表格，而是使用代码标签来表示状态。让我们看看使用与以前相同的 DFA 会是什么样子。

int scan(void) {
    char ch;

state0:
    ch = getchar();
    if (ch == 'a') goto state1;
    else { error(); return 0; }

state1:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3;
    else { error(); return 0; }

state2:
    return 1; /* Accept! */

state3:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3; /* Loop. */
    else { error(); return 0; }
}

以我个人的经验，编写词法分析器的“最佳”方法是我上面概述的手写方法。它适用于每种平台，每种语言，您无需学习像 lex 这样的古老工具，也许最重要的是，您可以灵活地扩展词法分析器，使其具有难以或不可能用工具实现的功能。例如，也许您希望您的词法分析器理解 Python 式的块缩进，或者您可能需要动态扩展某些标记类。

score -1 · Accepted Answer

看看我的词法生成器，它非常简单易懂，它生成 DFA 直接代码自动机，作为嵌套的 switch 指令。我在我的项目中使用了这种方法，首先是手写的，后来使用这个工具生成。该方法基于我通过阅读几本书和研究更高级的解析器生成器的实现来研究这个主题的经验。github上有一个项目——rmjocz/langgen

compiler-construction - 直接编码与表驱动的词法分析器？

2 回答 2

Related

Reference