lark-parser - 节点深度编码为星数

Question

这种语言的文档看起来像

* A top-level Headline 

  Some text about that headline.

** Sub-Topic 1

Text about the sub-topic 1.

*** Sub-sub-topic

 More text here about the sub-sub-topic

** Sub-Topic 2

   Extra text here about sub-topic 2

*** Other Sub-sub-topic

 More text here about the other sub-sub-topic

深度级别的数量是无限的。我想知道如何获得一个可以适当地构建嵌套树的解析器。我一直在寻找压头示例以获取灵感，但我还没有弄清楚。

score 1 · Accepted Answer

该问题需要上下文相关的语法，因此我们使用您链接的缩进示例中的解决方法：

我们编写了一个自定义postlex处理器来保存观察到的缩进级别的堆栈。当读取一个星号 ( *, **, ***, ...) 时，堆栈被弹出，直到堆栈上的缩进级别更小，然后新的级别被压入堆栈。对于每个 push/pop，相应的 INDENT/DEDENT 辅助标记被注入到标记流中。然后可以在语法中使用这些辅助标记来获得反映嵌套级别的解析树。

from lark import Lark, Token

tree_grammar = r"""
    start: NEWLINE* item*
    item: STARS nest
    nest: _INDENT (nest | LINE+ item*) _DEDENT
    STARS.2: /\*+/
    LINE.1: /.*/ NEWLINE
    %declare _INDENT _DEDENT
    %import common.NEWLINE
"""

class StarIndenter():
  STARS_type = 'STARS'
  INDENT_type = '_INDENT'
  DEDENT_type = '_DEDENT'

  def dedent(self, level, token):
    """ When the given level leaves the current nesting of the stack,
        inject corresponding number of DEDENT tokens into the stream.
    """
    while level <= self.indent[-1]:
      pop_level = self.indent.pop()
      pop_diff = pop_level - self.indent[-1]
      for _ in range(pop_diff):
        yield token

  def handle_stars(self, token):
    """ Handle tokens of the form '*', '**', '***', ...
    """

    level = len(token.value)

    dedent_token = Token.new_borrow_pos(self.DEDENT_type, '', token)
    yield from self.dedent(level, dedent_token)

    diff = level-self.indent[-1]
    self.indent.append(level)

    # Put star token into stream
    yield token

    indent_token = Token.new_borrow_pos(self.INDENT_type, '', token)
    for _ in range(diff):
      yield indent_token

  def process(self, stream):
    self.indent = [0]

    # Process token stream
    for token in stream:
      if token.type == self.STARS_type:
        yield from self.handle_stars(token)
      else:
        yield token

    # Inject closing dedent tokens
    yield from self.dedent(1, Token(self.DEDENT_type, ''))

  # No idea why this is needed
  @property
  def always_accept(self):
    return ()

parser = Lark(tree_grammar, parser='lalr', postlex=StarIndenter())

请注意，STARS已为终端分配了比LINES（通过.2vs. .1）更高的优先级，以防止LINES+占用以星号开头的行。

使用您的示例的精简版本：

test_tree = """
* A
** AA
*** AAA
** AB
*** ABA
"""

print(parser.parse(test_tree).pretty())

结果是：

start
  

  item
    *
    nest
       A

      item
        **
        nest
           AA

          item
            ***
            nest     AAA

      item
        **
        nest
           AB

          item
            ***
            nest     ABA

lark-parser - 节点深度编码为星数

1 回答 1

Related

Reference