antlr - 解析不同单词类型的句子

Question

我正在寻找一种用于分析两种句子的语法，这意味着用空格分隔的单词：

ID1：单词不以数字开头的句子
ID2：单词不以数字和数字开头的句子

基本上，语法的结构应该看起来像

ID1 separator ID2  

ID1: Word can contain number like Var1234 but not start with a number  

ID2: Same as above but 1234 is allowed  

separator: e. g. '='

@Bart
我只是尝试添加两个标记'_'并'"'作为 lexer-ruleSpecial供以后在 lexer-rule 中使用Word。即使我没有Special在以下语法中使用，我在 ANTLRWorks 1.4.2 中得到以下错误：
The following token definition can never be match because prior tokens match the same input: Special
But when I add fragmentbefore Special，我没有得到那个错误。为什么？

grammar Sentence1b1;

tokens
{
  TCUnderscore  = '_' ;
  TCQuote       = '"' ;
}

assignment
  :  id1 '=' id2
  ;

id1
  :  Word+
  ;

id2
  :  ( Word | Int )+
  ;

Int
  :  Digit+
  ;

// A word must start with a letter
Word
  :  ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
  ;

Special
  : ( TCUnderscore | TCQuote )
  ;

Space
  :  ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
  ;

fragment Digit
  :  '0'..'9'
  ;

Special然后应在 lexer-rule 中使用 Lexer- rule Word：

Word
  :  ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
  ;

score 1 · Accepted Answer

I'd go for something like this:

grammar Sentence;

assignment
  :  id1 '=' id2
  ;

id1
  :  Word+
  ;

id2
  :  (Word | Int)+
  ;

Int
  :  Digit+
  ;

// A word must start with a letter
Word
  :  ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
  ;

Space
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

fragment Digit
  :  '0'..'9'
  ;

which will parse the input:

Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed

as follows:

enter image description here

EDIT

To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):

// wrong!
Special      : (TCUnderscore | TCQuote);
TCUnderscore : '_';
TCQuote      : '"';

Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:

// wrong!
TCUnderscore : '_';
TCQuote      : '"';
Special      : (TCUnderscore | TCQuote);

the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:

The following token definitions can never be matched because prior tokens match the same input: ...

If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:

// good!
Special               : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote      : '"';

Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).

// wrong!
parse : TCUnderscore;

Special               : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote      : '"';

score 0 · Accepted Answer

我不确定这是否符合您的需求，但在我的帖子 ANTLR - 带有空格的标识符中 Bart 的帮助下，我得出了这个语法：

grammar PropertyAssignment;

assignment
    : id_nodigitstart '=' id_digitstart EOF
    ;

id_nodigitstart
    :   ID_NODIGITSTART+
    ;

id_digitstart
    :   (ID_DIGITSTART|ID_NODIGITSTART)+
    ;

ID_NODIGITSTART
    :   ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
    ;           

ID_DIGITSTART
    :   ('0'..'9'|'a'..'z'|'A'..'Z')+
    ;

WS  :   (' ')+ {skip();}
    ;

“a name = my 4value”有效，而“4a name = my 4value”导致异常。

antlr - 解析不同单词类型的句子

2 回答 2

EDIT

Related

Reference