1

There in a data stream are two packets. Each has the header followed by some binary data with unknown length, until another header is found, or EOF is reached. Here is the data: HDR12HDR345 HDR is the header marker 12 and 345 are the binary data.

And here is my current wrong grammar:

grammar TEST;

parse   :   STREAM EOF;
STREAM  :   PACKET*;
PACKET  :   HEADER DATA;
HEADER  :   'HDR';
DATA    :   .*;

The first header token is recognized, but the data token is too long and it consumes the next header and data.

After three days of looking for the solution I did not found any, which matches both, "binary data" and "unknown length" aspects. But stil I think that this must be some common scenario for parsing. ANTLR is not as easy as it looks like for the first sight :(

Thanks for any help or suggestions.

4

1 回答 1

2

没有任何东西直接放在 之后.*,ANTLR 将尽可能多地消耗(直到 EOF)。所以规则:

DATA : .*;

应该改变(之后必须有一些东西 .*

此外,每个词法分析器规则至少应该匹配一个字符。但是您的STREAM规则可能会匹配一个空字符串,从而导致您的词法分析器创建无限数量的空字符串标记。

最后,ANTLR 旨在解析文本输入,而不是二进制数据。有关更多信息,请参阅ANTLR 邮件列表上的此问答,或在列表中进行搜索

编辑

除了在 之后放置一些东西.*,您还可以在词法分析器中执行一些“手动”前瞻。一个小演示如何告诉 ANTLR 继续使用字符,直到词法分析器“看到”前面的东西("HDR"在你的情况下是 ):

grammar T;

@parser::members {
  public static void main(String[] args) throws Exception {
    String input = "HDR1 foo HDR2 bar \n\n baz HDR3HDR4 the end...";
    TLexer lexer = new TLexer(new ANTLRStringStream(input));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

@lexer::members {
  private boolean hdrAhead() {
    return input.LA(1) == 'H' && 
           input.LA(2) == 'D' && 
           input.LA(3) == 'R';
  }
}

parse  : stream EOF;
stream : packet*; // parser rules _can_ match nothing
packet : HEADER DATA? {System.out.println("parsed :: " + $text.replaceAll("\\s+", ""));};
HEADER : 'HDR' '0'..'9'+;
DATA   : ({!hdrAhead()}?=> .)+;

如果你运行上面的演示:

java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser

(在 Windows 上,最后一个命令是java -cp .;antlr-3.3.jar TParser:)

以下内容打印到控制台:

parsed :: HDR1foo
parsed :: HDR2barbaz
parsed :: HDR3
parsed :: HDR4theend...

对于输入字符串:

HDR1 foo HDR2 bar 

baz HDR3HDR4 the end...
于 2011-12-18T19:01:30.637 回答