4

以下极其简单的示例语法并不像我预期的那样(完全)。

Declaration :   'VAR';
Letter: ('A'..'Z');

message :   Declaration Letter+;

结果我所期望的是,任何字母序列都将作为单个字母进行词法分析,而“VAR”序列将作为单个标记进行词法分析。

当我查看 ANTLRWorks interperter 时,我看到以下结果:

  • VARA解析为message -> "VAR", "A"(预期的)
  • VARVA不解析 (MismatchedTokenException(-1 != 5)。词法分析器命中第二个VA并尝试标记化Declaration。预期:message -> "VAR", "V", "A"
  • VARVPP解析为message -> "VAR", "V", "P", "P"(预期的)
  • VARVALL解析为message -> "VAR", "VALL".

我需要一些帮助来理解这种行为,以及如何解决这个问题的建议。

具体来说:

  • 为什么词法分析器尝试将所有以VA声明开头的字符串标记化,如果它后面跟着一个字母?
  • 为什么词法分析器不尝试对所有以 a 开头的字符串执行此操作V
  • 如果那里有额外的字符,为什么词法分析器不尝试这样做?
  • 我应该如何更改此语法以解析我期望的方式?
4

2 回答 2

5

让我们来看看你所有的 4 个例子:

1个“VARA”

enter image description here

一切都好。

2“瓦尔瓦”

"VAR" is (obviously) tokenized as VAR, but then the lexer "sees" "VA" and expects an "R", which is not there. It emits the following errors:

line 1:5 mismatched character '<EOF>' expecting 'R'
line 1:5 required (...)+ loop did not match anything at input '<EOF>'

and discards the "VA" resulting in a single token to be created, as you can see when running ANTLRWorks' debugger (ignore the exceptions in the parse, they're not actually there :)):

enter image description here

The thing you must realize is that the lexer will never give up on something it has already matched. So if the lexer sees "VA" and cannot match an "R" after it, it will then look at the other lexer rules that can match "VA". But Letter does not match that (it only matches single letters!) If you change Letter to match more than a single character, ANTLR would be able to fall back on that rule. But not when it matches a single letter: the lexer will not give up the "A" from "VA" to let the Letter rule match. No way around it: this is how ANTLR's lexer works.

This is usually not an issue because there is often some sort of IDENTIFIER rule that the lexer can fall back on when a keyword cannot be matched.

3 "VARVPP"

enter image description here

All okay: "VAR" becomes a VAR and then the lexer tries to match an "A" after the "V" but this does not happen, so the lexer falls back on the Letter rule for the single "V". After that "PP" are both tokenized as Letters.

4 "VARVALL"

"VAR" again becomes a VAR. Then the "L" in "VAL" causes the lexer to produce the following error message:

line 1:5 mismatched character 'L' expecting 'R'

and then the last "L" becomes a Letter:

enter image description here


I guess (or hope) the first 3 question are now answered, which leaves your final answer:

How should I change this grammar to parse the way I expected?

By forcing the lexer to first look ahead in the character stream if there really is "VAR" ahead, and if there's not, just match a single "V" and change the type of the matched token to Letter, like this:

Declaration
 : ('VAR')=> 'VAR'
 |           'V'   {$type=Letter;}
 ;

As mentioned before my answer, see this related Q&A: ANTLR lexer can't lookahead at all

于 2012-12-03T12:42:15.700 回答
2

词法分析器并没有真正执行前瞻,只有解析器执行;您可以在ANTLR lexer can't lookahead at all中阅读更多相关信息。所以这里的问题是,一旦词法分析器无法匹配VAR,它就会尝试匹配它到目前为止得到的东西VA——并且没有匹配的标记,因为 Letter 不能匹配两个字符,只能匹配一个。

至于解决方案,一个简单的方法是将其更改为单个令牌:

Message :   'VAR' ('A'..'Z')+;
message :   Message;

不过,它不会为每个字母提供不同的标记。

于 2012-12-03T12:14:14.087 回答