parsing - 递归下降解析器实现

Question

我正在寻找编写一些递归下降解析器的伪代码。现在，我对这种类型的编码没有经验。我在网上阅读了一些示例，但它们仅适用于使用数学表达式的语法。这是我基于解析器的语法。

S -> if E then S | if E then S else S | begin S L | print E

L -> end | ; S L

E -> i

我必须编写方法S()，L()并E()返回一些错误消息，但是我在网上找到的教程并没有太大帮助。谁能指出我正确的方向并给我一些例子？

我想用 C# 或 Java 语法编写它，因为它更容易联系起来。

更新

public void S() {
    if (currentToken == "if") {
        getNextToken();
        E();

        if (currentToken == "then") {
            getNextToken();
            S();

            if (currentToken == "else") {
                getNextToken();
                S();
                Return;
            }
        } else {
            throw new IllegalTokenException("Procedure S() expected a 'then' token " + "but received: " + currentToken);
        } else if (currentToken == "begin") {
            getNextToken();
            S();
            L();
            return;
        } else if (currentToken == "print") {
            getNextToken();
            E();
            return;
        } else {
            throw new IllegalTokenException("Procedure S() expected an 'if' or 'then' or else or begin or print  token " + "but received: " + currentToken);
        }
    }
}


public void L() {
    if (currentToken == "end") {
        getNextToken();
        return;
    } else if (currentToken == ";") {
        getNextToken();
        S();
        L();
        return;
    } else {
        throw new IllegalTokenException("Procedure L() expected an 'end' or ';' token " + "but received: " + currentToken);
    }
}


public void E() {
    if (currentToken == "i") {
        getNextToken();
        return;
    } else {
        throw new IllegalTokenException("Procedure E() expected an 'i' token " + "but received: " + currentToken);
    }
}

score 18 · Accepted Answer

基本上在递归下降中，将语法中的每个非终结符解析为一个过程，然后在每个过程中检查您正在查看的当前标记是否与您期望在非终结符右侧看到的相符与程序对应的终端符号，如果是则继续应用产生式，如果不是，则您遇到错误并且必须采取一些措施。

因此，在您上面提到的情况下，您将有程序：S()、L()和E()，我将举例说明如何实施，然后L()您可以自己尝试和做。S()E()

同样重要的是要注意，您将需要一些其他程序来为您标记输入，然后您可以只要求该程序从您的输入中获取下一个标记。

/**
 * This procedure corresponds to the L non-terminal
 * L -> 'end'
 * L -> ';' S L
 */ 
public void L()
{
   if(currentToken == 'end')
   {
      //we found an 'end' token, get the next token in the input stream
      //Notice, there are no other non-terminals or terminals after 
      //'end' in this production so all we do now is return
      //Note: that here we return to the procedure that called L()
      getNextToken();
      return; 
   } 
   else if(currentToken == ';')
   {
      //we found an ';', get the next token in the input stream
      getNextToken();
      //Notice that there are two more non-terminal symbols after ';'
      //in this production, S and L; all we have to do is call those
      //procedures in the order they appear in the production, then return
      S();
      L();
      return;
   }
   else
   {
      //The currentToken is not either an 'end' token or a ';' token 
      //This is an error, raise some exception
      throw new IllegalTokenException(
          "Procedure L() expected an 'end' or ';' token "+
          "but received: " + currentToken);
   }
}

现在您尝试S()and E()，然后回发。

同样正如 Kristopher 指出的那样，你的语法有一个叫做 dangling else 的东西，这意味着你有一个从同一件事开始的作品：

S -> if E then S 
S -> if E then S else S

因此，如果您的解析器看到“if”标记，这就引出了一个问题，它应该选择哪个产品来处理输入？答案是它不知道选择哪一个，因为与人类不同，编译器无法提前查看输入流以搜索“else”标记。这是一个简单的问题，可以通过应用称为左因子的规则来解决，这与代数问题的因子非常相似。

你所要做的就是创建一个新的非终结符号 S'（S-prime），它的右手边将保存不常见的产品，所以你的S产品 no 变成：

S  -> if E then S S'
S' -> else S 
S' -> e   
(e is used here to denote the empty string, which basically means there is no   
 input seen)

score 8 · Accepted Answer

这不是最容易开始的语法，因为您对第一个生产规则有无限量的前瞻：

S -> if E then S | if E then S else S |  begin S L | print E

考虑

if 5 then begin begin begin begin ...

我们什么时候确定这个愚蠢的其他？

另外，考虑

if 5 then if 4 then if 3 then if 2 then print 2 else ...

现在，这else应该绑定到if 5 then片段吗？如果不是，那实际上很酷，但要明确。

您可以等效地重写语法（可能取决于 else 规则）：

S -> if E then S (else S)? | begin S L | print E
L -> end | ; S L
E -> i

这可能是也可能不是你想要的。但是伪代码从这里跳出来了。

define S() {
  if (peek()=="if") {
    consume("if")
    E()
    consume("then")
    S()
    if (peek()=="else") {
      consume("else")
      S()
    }
  } else if (peek()=="begin") {
    consume("begin")
    S()
    L()
  } else if (peek()=="print") {
    consume("print")
    E()
  } else {
    throw error()
  }
}

define L() {
  if (peek()=="end") {
    consume("end")
  } else if (peek==";")
    consume(";")
    S()
    L()
  } else {
    throw error()
  }
}

define E() {
  consume_token_i()
}

对于每个备选方案，我创建了一个查看唯一前缀的 if 语句。任何匹配尝试的最终 else 始终是错误的。我使用关键字并在遇到它们时调用与生产规则相对应的函数。

从伪代码转换为真实代码并不太复杂，但也不是微不足道的。那些偷看和消费可能实际上并没有对字符串进行操作。对令牌进行操作要容易得多。简单地走一个句子并使用它与解析它并不完全相同。当你使用这些片段时，你会想做一些事情，可能会构建一个解析树（这意味着这些函数可能会返回一些东西）。并且抛出一个错误在高层次上可能是正确的，但您希望将一些有意义的信息放入错误中。此外，如果您确实需要前瞻，事情会变得更加复杂。

在查看此类问题时，我会推荐 Terence Parr（编写 antlr，递归下降解析器生成器的人）的 Language Implementation Patterns。Dragon Book（Aho 等人，在评论中推荐）也很好（它可能仍然是编译器课程中的主要教科书）。

score 4 · Accepted Answer

上学期我教（真的只是帮助）了 PL 课程的解析部分。我真的建议您查看我们页面中的解析幻灯片：here。基本上，对于递归下降解析，你问自己以下问题：

我已经解析了一些非终结符，现在我可以选择接下来要解析的内容。我接下来看到的将决定我所处的非终端。

顺便说一句，你的语法表现出一种非常常见的歧义，称为“dangling else”，这种歧义自 Algol 时代以来就一直存在。在 shift reduce 解析器（通常由解析器生成器生成）中，这会产生 shift / reduce 冲突，您通常会选择任意 shift 而不是 reduce，从而为您提供常见的“maximal much”原则。（因此，如果您看到“if (b) then if (b2) S1 else S2”，您会将其读作“if (b) then { if (b2) { s1; } else { s2; } }”）

让我们把它从你的语法中去掉，并使用一个稍微简单的语法：

T -> A + T
 |   A - T
 |   A
A -> NUM * A
   | NUM / A
   | NUM

我们还将假设 NUM 是您从词法分析器获得的东西（即，它只是一个标记）。这个语法是 LL(1)，也就是说，你可以用一个使用朴素算法实现的递归下降解析器来解析它。该算法的工作原理如下：

parse_T() {
  a = parse_A();
  if (next_token() == '+') {
    next_token();  // advance to the next token in the stream
    t = parse_T();
    return T_node_plus_constructor(a, t);
  }
  else if (next_token() == '-') {
    next_token();  // advance to the next token in the stream
    t = parse_T();
    return T_node_minus_constructor(a, t);
  } else {
    return T_node_a_constructor(a);
  }
}
parse_A() {
  num = token(); // gets the current token from the token stream
  next_token();  // advance to the next token in the stream
  assert(token_type(a) == NUM);
  if (next_token() == '*') {
    a = parse_A();
    return A_node_times_constructor(num, a);
  }
  else if (next_token() == '/') {
    a = parse_A();
    return A_node_div_constructor(num, a);
  } else {
    return A_node_num_constructor(num);
  }
}

你看到这里的模式了吗：

对于语法中的每个非终结符：解析第一件事，然后看看你必须看什么来决定接下来应该解析什么。继续这个直到你完成！

另外，请注意，这种类型的解析通常不会产生您想要的算术。递归下降解析器（除非你使用尾递归的小技巧？）不会产生最左边的推导。特别是，您不能编写像“a -> a - a”这样的左递归规则，其中最左边的关联性确实是必要的！这就是人们通常使用更高级的解析器生成器工具的原因。但是递归下降技巧仍然值得了解和玩弄。

parsing - 递归下降解析器实现

更新

3 回答 3

Related

Reference