python - 你将如何解析缩进（python 风格）？

Question

您将如何定义解析器和词法分析器规则来解析使用缩进定义范围的语言。

我已经用谷歌搜索并找到了一种通过在词法分析器中生成 INDENT 和 DEDENT 标记来解析它的聪明方法。

如果我遇到有趣的事情，我会更深入地研究这个问题并发布答案，但我想看看解决这个问题的其他方法。

编辑：正如查理指出的那样，如果不一样的话，已经有另一个非常相似的线程了。我的帖子应该被删除吗？

score 11 · Accepted Answer

这是一种假设，因为它取决于您的词法分析器和解析器拥有什么技术，但最简单的方法似乎是使用类似于 C 中的大括号的 BEGINBLOCK 和 ENDBLOCK 标记。使用您的词法分析器需要的“越位规则”跟踪一堆缩进级别。当缩进级别增加时，为解析器发出一个 BEGINBLOCK ；当缩进级别降低时，发出 ENDBLOCK 并从堆栈中弹出级别。

这是关于SO的另一个讨论，顺便说一句。

score 1 · Accepted Answer

Also you can track somewhere in lexer how many ident items are preceding first line and pass it to parser. Most interesting part would be trying to pass it to parser correctly :) If your parser uses lookahead (here I mean parser may query for variable number of tokens before it really going to match even one) then trying to pass it through one global variable seems to be very bad idea (because lexer can slip on next line and change value of indent counter while parser is still trying to parse previous line). Also globals are evil in many other cases ;) Marking first line 'real' token in someway with indent counter is more reasonable. I can't give you exact example (I don't even know what parser and lexer generators are you going to use if any...) but something like storing data on first line tokens (it could be non comfortable if you can't easily get such token from parser) or saving custom data (map that links tokens to indent, array where every line in source code as index and indent value as element value) seems to be enough. One downside of this approach is additional complexity to parser that will need to distinguish between ident values and change its behavior based on it. Something like LOOKAHEAD({ yourConditionInJava }) for JavaCC may work here but it is NOT a very good idea. A lot of additional tokens in your approach seems to be less evil thing to use :)

As another alternative I would suggest is to mix this two approaches. You could generate additional tokens only when indent counter changes its value on next line. It is like artificial BEGIN and END token. In this way you may lower number of 'artificial' tokens in your stream fed into parser from lexer. Only your parser grammar should be adjusted to understand additional tokens...

I didn't tried this (have no real experience with such languages parsing), just sharing my thoughts about possible solutions. Checking already built parsers for this kinds of languages could be of great value for you. Open source is your friend ;)

python - 你将如何解析缩进（python 风格）？

2 回答 2

Related

Reference