whitespace - How to specify 'greedy identifiers with a space' in ANTLR?

Question

Suppose we have the input that looks like the sequence of simple English statements, each on a separate line, like these:

Alice checks
Bob bets 100
Charlie raises 100
Alice folds

Let's try parsing it with this grammar:

actions: action* EOF;
action: player=name (check | call | raise | fold) NEWLINE;
check: 'checks';
call: 'calls' amount;
raise: 'raises' amount;
fold: 'folds';

name: /* The subject of this question */;
amount: '$'? INT;

INT: ('0'..'9')+;
NEWLINE: '\r'? '\n';

The number of different verbs is fixed, but what's interesting is that name that we are trying to match could have spaces in it - and verbs could potentially be parts of it, too! So the following input is valid:

Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100

So the question is: how do we define name so it is greedy just enough to eat spaces and words that we are usually treating as verbs, but is not super-greedy so that the verbs could still be matched by action rule?

My first attempt at solving this task was looking like this:

name: WORD (S WORD)*;
WORD: ('a'..'z'|'A'..'Z'|'0'..'9')+; // Yes, 1234 is a WORD, too...
S: ' '; // We have to keep spaces in names

Unfortunately, this will not match 'Guy who always bets', since bets is not a WORD, but a different token, defined by a literal in bets rule. I wanted to get around that by creating a rule like keyword[String word], and making other rules match, say, keyword["bets"] instead of a literal, but that's where I got stuck. (I guess I could just list all my verbs as valid alternates to be a part of a name, but it just feels wrong.)

Here is what more: all the names are declared before they are used, so I can read them before I start parsing actions. And they can't be longer than MAX_NAME_LENGTH chars long. Can it be of any help here?

Maybe I'm doing it wrong, anyway. ANTLR gurus, can I hear from you?

score 2 · Accepted Answer

最简单的方法是在整个语法上启用全局回溯。这通常是不推荐的，但我想你的语法会保持相对较小，在这种情况下，它对你的解析器的运行时间并不重要。如果您确实发现它变慢了，您可以取消注释 memoize 选项，这将使您的解析器更快，但会消耗一些内存。

一个演示：

在.txt

总是下 100 张支票的人
总是过牌 100 的人
总是跟注弃牌的家伙
总是弃牌的人加注 100
总是过牌然后别人加注的人跟注 100 美元

扑克.g

grammar Poker;

options {
  backtrack=true;
  // memoize=true;
}

actions
  :  action* EOF
  ;

action
  :  name SPACES (bets | calls | raises | CHECKS | FOLDS) SPACES? (NEWLINE | EOF)
     {
       System.out.println($name.text);
     }
  ;

bets    : BETS SPACES amount;
calls   : CALLS SPACES amount;
raises  : RAISES SPACES amount;
name    : anyWord (SPACES anyWord)*;
amount  : '$'? INT;
anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; 

BETS    : 'bets';
FOLDS   : 'folds';
CHECKS  : 'checks';
CALLS   : 'calls';
RAISES  : 'raises';
WORD    : ('a'..'z' | 'A'..'Z')+;
INT     : '0'..'9'+;
SPACES  : ' '+;
NEWLINE : '\r'? '\n';

主.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PokerLexer lexer = new PokerLexer(new ANTLRFileStream("in.txt"));
    PokerParser parser = new PokerParser(new CommonTokenStream(lexer));
    parser.actions();
  }
}

运行 Main 类会产生：

bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp antlr-3.3.jar org.antlr.Tool Poker.g
bart@hades:~/Programming/ANTLR/Demos/Poker$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp .:antlr-3.3.jar Main
总是赌100的人
总是检查的家伙
总是打电话的人
总是弃牌的家伙
总是过牌然后别人加注的人

编辑

你可以反过来做：否定你不想匹配anyWord的标记：

// other parser rules
anyWord : ~(SPACES | NEWLINE | DOLLAR); 

BETS    : 'bets';
FOLDS   : 'folds';
CHECKS  : 'checks';
CALLS   : 'calls';
RAISES  : 'raises';
WORD    : ('a'..'z' | 'A'..'Z')+;
INT     : '0'..'9'+;
DOLLAR  : '$';
SPACES  : ' '+;
NEWLINE : '\r'? '\n';

anyWord现在匹配除,和's之外的任何标记。注意内部词法分析器规则（否定字符）和解析器规则（否定标记！）之间的区别。SPACESNEWLINEDOLLAR~

score 0 · Accepted Answer

简单的解决方案：在空格上拆分，逐字反转输入，然后从右侧而不是从左侧解析。（当然，这需要重写你的语法。）

whitespace - How to specify 'greedy identifiers with a space' in ANTLR?

2 回答 2

在.txt

扑克.g

主.java

编辑

Related

Reference