string - Antlr4 将文本从词法分析器作为字符串而不是单个字符传回解析器

Question

我有一个语法，需要在输入流中的任何点处理以“{*”开头并以 *} 结尾的注释。它还需要处理以“{”开头后跟“$”或标识符并以“}”结尾的模板标记，并将其他所有内容作为文本传递。

实现这一点的唯一方法似乎是将任何不是注释或标记的东西作为单个字符传回解析器，并让解析器构建字符串。这是非常低效的，因为解析器必须为它接收到的每个字符构建一个节点，然后我必须遍历节点并从中构建一个字符串。如果词法分析器可以将文本作为大字符串返回，我会更简单更快。

在 I7 上，将程序作为 32 位 #C 程序在 90K 文本文件上运行，没有标记或注释，只有文本，它需要大约 15 分钟才能因内存异常而崩溃。

语法基本上是

Parser:
text: ANY_CHAR+;

Lexer:

COMMENT: '{*' .*? '*}' -> skip;

... Token Definitions .....

ANY_CHAR: [ -~];

如果我尝试在词法分析器中累积文本，它会吞下所有内容并且无法识别注释或标记，因为 ANY_CHAR+ 之类的内容匹配所有内容并返回字符串中的注释和模板标记。

有人知道解决这个问题的方法吗？目前看来我必须手写一个词法分析器。

score 0 · Accepted Answer

Yes, that is inefficient, but also not the way to do it. The solution is completely in lexer.

I understood that you want to detect comments, template markers and text. For this, you should use lexer modes. Every time you hit "{" go into some lexer mode, say MODE1 where you can detect only "*" or "$" or (since I didn't understand what you meant by '{' followed by a '$' or and identifier) something else, and depending on what you hit go into MODE2 or MODE3. After that (MODE2 or MODE3) wait for '}' and switch back to default mode. Of course, there is the possibility to make even more modes in between, depends on what you want do to, but for what I've just written:

MODE1 would be in which you determine if you area now detecting comment or template marker. Only two tokens in this mode '' and everything else. If it's '' go to MODE2, if anything else go to MODE3
MODE2 there is only one token here that you need and that is COMMENT,but you also need to detect '*}' or '}' (depending how you want to handle it)
MODE3 similarly as MODE2 - detect what you need and have a token that will switch back to default mode.

string - Antlr4 将文本从词法分析器作为字符串而不是单个字符传回解析器

1 回答 1

Related

Reference