parsing - 用于 Python 样式缩进的 PEG

Question

您将如何在以下任何可以处理 Python/Haskell/CoffeScript 样式缩进的解析器生成器（PEG.js、Citrus、Treetop ）中编写解析表达式语法：

尚不存在的编程语言的示例：

square x =
    x * x

cube x =
    x * square x

fib n =
  if n <= 1
    0
  else
    fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets

更新： 不要尝试为上面的示例编写解释器。我只对缩进问题感兴趣。另一个示例可能是解析以下内容：

foo
  bar = 1
  baz = 2
tap
  zap = 3

# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }

score 24 · Accepted Answer

纯 PEG 无法解析缩进。

但是peg.js可以。

我做了一个快速而肮脏的实验（受到 Ira Baxter 关于作弊的评论的启发）并编写了一个简单的标记器。

有关更完整的解决方案（完整的解析器），请参阅以下问题：Parse indentation level with PEG.js

/* Initializations */
{
  function start(first, tail) {
    var done = [first[1]];
    for (var i = 0; i < tail.length; i++) {
      done = done.concat(tail[i][1][0])
      done.push(tail[i][1][1]);
    }
    return done;
  }

  var depths = [0];

  function indent(s) {
    var depth = s.length;

    if (depth == depths[0]) return [];

    if (depth > depths[0]) {
      depths.unshift(depth);
      return ["INDENT"];
    }

    var dents = [];
    while (depth < depths[0]) {
      depths.shift();
      dents.push("DEDENT");
    }

    if (depth != depths[0]) dents.push("BADDENT");

    return dents;
  }
}

/* The real grammar */
start   = first:line tail:(newline line)* newline? { return start(first, tail) }
line    = depth:indent s:text                      { return [depth, s] }
indent  = s:" "*                                   { return indent(s) }
text    = c:[^\n]*                                 { return c.join("") }
newline = "\n"                                     {}

depths是一堆缩进。indent() 返回一个缩进标记数组， start() 解包该数组以使解析器的行为有点像流。

peg.js为文本生成：

alpha
  beta
  gamma
    delta
epsilon
    zeta
  eta
theta
  iota

这些结果：

[
   "alpha",
   "INDENT",
   "beta",
   "gamma",
   "INDENT",
   "delta",
   "DEDENT",
   "DEDENT",
   "epsilon",
   "INDENT",
   "zeta",
   "DEDENT",
   "BADDENT",
   "eta",
   "theta",
   "INDENT",
   "iota",
   "DEDENT",
   "",
   ""
]

这个分词器甚至可以捕捉到不好的缩进。

score 8 · Accepted Answer

我认为像这样的缩进敏感语言是上下文敏感的。我相信 PEG 只能做上下文无关的语言。

请注意，虽然 nalply 的回答肯定是正确的，PEG.js 可以通过外部状态（即可怕的全局变量）来做到这一点，但它可能是一条危险的路径（比全局变量的常见问题更糟糕）。一些规则最初可以匹配（然后运行它们的操作），但父规则可能会失败，从而导致操作运行无效。如果在此类操作中更改了外部状态，则最终可能会出现无效状态。这非常可怕，可能导致震颤、呕吐和死亡。此处的评论中有一些问题和解决方案：https ://github.com/dmajda/pegjs/issues/45

score 8 · Accepted Answer

所以我们在这里真正用缩进做的是创建类似 C 风格的块，它通常有自己的词法范围。如果我正在为这样的语言编写编译器，我想我会尝试让词法分析器跟踪缩进。每次缩进增加时，它都可以插入一个“{”标记。同样，每次它减少时，它都可以插入一个“}”标记。然后用显式花括号编写表达式语法来表示词法范围变得更加直接。

score 2 · Accepted Answer

您可以通过使用语义谓词在 Treetop 中执行此操作。在这种情况下，您需要一个语义谓词来检测由于出现具有相同或更少缩进的另一行而关闭空白缩进块。谓词必须从开始行计算缩进，如果当前行的缩进以相同或更短的长度结束，则返回 true（块关闭）。因为结束条件是依赖于上下文的，所以它不能被记忆。这是我将要添加到 Treetop 文档中的示例代码。请注意，我已经覆盖了 Treetop 的 SyntaxNode 检查方法，以便更轻松地可视化结果。

grammar IndentedBlocks
  rule top
    # Initialise the indent stack with a sentinel:
    &{|s| @indents = [-1] }
    nested_blocks
    {
      def inspect
        nested_blocks.inspect
      end
    }
  end

  rule nested_blocks
    (
      # Do not try to extract this semantic predicate into a new rule.
      # It will be memo-ized incorrectly because @indents.last will change.
      !{|s|
        # Peek at the following indentation:
        save = index; i = _nt_indentation; index = save
        # We're closing if the indentation is less or the same as our enclosing block's:
        closing = i.text_value.length <= @indents.last
      }
      block
    )*
    {
      def inspect
        elements.map{|e| e.block.inspect}*"\n"
      end
    }
  end

 rule block
    indented_line       # The block's opening line
    &{|s|               # Push the indent level to the stack
      level = s[0].indentation.text_value.length
      @indents << level
      true
    }
    nested_blocks       # Parse any nested blocks
    &{|s|               # Pop the indent stack
      # Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
      @indents.pop
      true
    }
    {
      def inspect
        indented_line.inspect +
          (nested_blocks.elements.size > 0 ? (
            "\n{\n" +
            nested_blocks.elements.map { |content|
              content.block.inspect+"\n"
            }*'' +
            "}"
          )
          : "")
      end
    }
  end

  rule indented_line
    indentation text:((!"\n" .)*) "\n"
    {
      def inspect
        text.text_value
      end
    }
  end

  rule indentation
    ' '*
  end
end

这是一个小测试驱动程序，因此您可以轻松尝试：

require 'polyglot'
require 'treetop'
require 'indented_blocks'

parser = IndentedBlocksParser.new

input = <<END
def foo
  here is some indented text
    here it's further indented
    and here the same
      but here it's further again
      and some more like that
    before going back to here
      down again
  back twice
and start from the beginning again
  with only a small block this time
END 

parse_tree = parser.parse input

p parse_tree

score 0 · Accepted Answer

我知道这是一个旧线程，但我只是想在答案中添加一些 PEGjs 代码。此代码将解析一段文本并将其“嵌套”成一种“AST-ish”结构。它只深入了一层，看起来很丑，而且它并没有真正使用返回值来创建正确的结构，而是保留了一个内存中的语法树，它会在最后返回它。这可能会变得笨拙并导致一些性能问题，但至少它做了它应该做的事情。

注意：确保你有制表符而不是空格！

{ 
    var indentStack = [], 
        rootScope = { 
            value: "PROGRAM",
            values: [], 
            scopes: [] 
        };

    function addToRootScope(text) {
        // Here we wiggle with the form and append the new
        // scope to the rootScope.

        if (!text) return;

        if (indentStack.length === 0) {
            rootScope.scopes.unshift({
                text: text,
                statements: []
            });
        }
        else {
            rootScope.scopes[0].statements.push(text);
        }
    }
}

/* Add some grammar */

start
  = lines: (line EOL+)*
    { 
        return rootScope;
    }


line
  = line: (samedent t:text { addToRootScope(t); }) &EOL
  / line: (indent t:text { addToRootScope(t); }) &EOL
  / line: (dedent t:text { addToRootScope(t); }) &EOL
  / line: [ \t]* &EOL
  / EOF

samedent
  = i:[\t]* &{ return i.length === indentStack.length; }
    {
        console.log("s:", i.length, " level:", indentStack.length);
    }

indent
  = i:[\t]+ &{ return i.length > indentStack.length; }
    {
        indentStack.push(""); 
        console.log("i:", i.length, " level:", indentStack.length);
    }

dedent
    = i:[\t]* &{ return i.length < indentStack.length; }
      {
          for (var j = 0; j < i.length + 1; j++) {
              indentStack.pop();
          } 
          console.log("d:", i.length + 1, " level:", indentStack.length);  
      }

text
    = numbers: number+  { return numbers.join(""); } 
    / txt: character+   { return txt.join(""); }


number
    = $[0-9] 

character 
    = $[ a-zA-Z->+]  
__
    = [ ]+

_ 
    = [ ]*

EOF 
    = !.

EOL
    = "\r\n" 
    / "\n" 
    / "\r"

parsing - 用于 Python 样式缩进的 PEG

5 回答 5

Related

Reference