lua - 使用 lpeg 解析类 TeX 语言

Question

我正在努力了解 LPEG。我已经设法产生了一种语法，它可以满足我的需求，但是我一直在努力反对这个语法并且没有走远。这个想法是解析一个TeX的简化形式的文档。我想将文档拆分为：

Environments，它们是\begin{cmd}和\end{cmd}对。
命令可以接受这样的参数：\foo{bar}或者可以是裸的：\foo。
环境和命令都可以具有如下参数：\command[color=green,background=blue]{content}.
其他东西。

我还想跟踪行号信息以进行错误处理。这是我到目前为止所拥有的：

lpeg = require("lpeg")
lpeg.locale(lpeg)
-- Assume a lot of "X = lpeg.X" here.

-- Line number handling from http://lua-users.org/lists/lua-l/2011-05/msg00607.html
-- with additional print statements to check they are working.
local newline = P"\r"^-1 * "\n" / function (a) print("New"); end
local incrementline = Cg( Cb"linenum" )/ function ( a ) print("NL");  return a + 1 end , "linenum"
local setup = Cg ( Cc ( 1) , "linenum" )
nl = newline * incrementline
space = nl + lpeg.space

-- Taken from "Name-value lists" in http://www.inf.puc-rio.br/~roberto/lpeg/
local identifier = (R("AZ") + R("az") + P("_") + R("09"))^1
local sep = lpeg.S(",;") * space^0
local value = (1-lpeg.S(",;]"))^1
local pair = lpeg.Cg(C(identifier) * space ^0 * "=" * space ^0 * C(value)) * sep^-1
local list = lpeg.Cf(lpeg.Ct("") * pair^0, rawset)
local parameters = (P("[") * list * P("]")) ^-1

-- And the rest is mine

anything = C( (space^1 + (1-lpeg.S("\\{}")) )^1) * Cb("linenum") / function (a,b) return { text = a, line = b } end

begin_environment = P("\\begin") * Ct(parameters) * P("{") * Cg(identifier, "environment") * Cb("environment") * P("}") / function (a,b) return { params = a[1], environment = b } end
end_environment = P("\\end{") * Cg(identifier) * P("}") 

texlike = lpeg.P{
  "document";
  document = setup * V("stuff") * -1,
  stuff = Cg(V"environment" + anything + V"bracketed_stuff" + V"command_with" + V"command_without")^0,
  bracketed_stuff = P"{" * V"stuff" * P"}" / function (a) return a end,
  command_with =((P("\\") * Cg(identifier) * Ct(parameters) * Ct(V"bracketed_stuff"))-P("\\end{")) / function (i,p,n) return { command = i, parameters = p, nodes = n } end,
  command_without = (( P("\\") * Cg(identifier) * Ct(parameters) )-P("\\end{")) / function (i,p) return { command = i, parameters = p } end,
  environment = Cg(begin_environment * Ct(V("stuff")) * end_environment) / function (b,stuff, e) return { b = b, stuff = stuff, e = e} end
}

它几乎可以工作！

> texlike:match("\\foo[one=two]thing\\bar")
{
  command = "foo",
  parameters = {
    {
      one = "two",
    },
  },
}
{
  line = 1,
  text = "thing",
}
{
  command = "bar",
  parameters = {
  },
}

但！首先，我根本无法让行号处理部分工作。内部的函数incrementline永远不会被触发。

我也不太清楚嵌套的捕获信息是如何传递给处理函数的（这就是为什么我在语法上分散和半随机的Cg原因C）。Ct这意味着从 a 中只返回一个项目command_with：

> texlike:match("\\foo{text \\command moretext}")
{
  command = "foo",
  nodes = {
    {
      line = 1,
      text = "text ",
    },
  },
  parameters = {
  },
}

我也希望能够检查环境的开始和结束是否匹配，但是当我尝试这样做时，当我到达“结束”时，“开始”的反向引用不在范围内。我不知道从这里去哪里。

score 6 · Accepted Answer

迟到的答案，但如果您仍在寻找解决方案或想知道问题所在，希望它能提供一些见解。

你的语法有几个问题，其中一些可能很难发现。

您的行增量在这里看起来不正确：

local incrementline = Cg( Cb"linenum" ) / 
                      function ( a ) print("NL");  return a + 1 end, 
                      "linenum"

看起来您打算创建一个命名捕获组而不是匿名组。backcapturelinenum本质上就像一个变量一样被使用。问题是因为这是在匿名捕获中，linenum不会正确更新——function(a)调用时总是会收到 1。您需要将关闭移动到)最后，因此"linenum"包括：

local incrementline = Cg( Cb"linenum" / 
                      function ( a ) print("NL");  return a + 1 end, 
                      "linenum")

用于捕获的相关LPeg 文档。Cg

第二个问题是您的anything非终端规则：

anything = C( (space^1 + (1-lpeg.S("\\{}")) )^1) * Cb("linenum") ...

这里有几件事需要注意。首先，一个命名 Cg的捕获（从incrementline规则一旦它被修复）不会产生任何东西，除非它在一个表中或者你 backref 它。第二个主要的事情是它有一个像变量一样的临时范围。更准确地说，一旦你在外部捕获中关闭它，它的作用域就会结束——就像你在这里所做的那样：

C( (space^1 + (...) )^1)

这意味着当你用引用它的回溯时* Cb("linenum")，已经太晚了——linenum你真正想要的已经关闭了它的范围。

我总是发现 LPeg 的re语法更容易理解，所以我用它重写了语法：

local grammar_cb =
{
  fold = pairfold, 
  resetlinenum = resetlinenum,
  incrementlinenum = incrementlinenum, getlinenum = getlinenum, 
  error = error
}

local texlike_grammar = re.compile(
[[
  document    <- '' -> resetlinenum {| docpiece* |} !.
  docpiece    <- {| envcmd |} / {| cmd |} / multiline
  beginslash  <- cmdslash 'begin'
  endslash    <- cmdslash 'end'
  envcmd      <- beginslash paramblock? {:beginenv: envblock :} (!endslash docpiece)*
                 endslash openbrace {:endenv: =beginenv :} closebrace / &beginslash {} -> error .
  envblock    <- openbrace key closebrace
  cmd         <- cmdslash {:command: identifier :} (paramblock? cmdblock)?
  cmdblock    <- openbrace {:nodes: {| docpiece* |} :} closebrace
  paramblock  <- opensq ( {:parameters: {| parampairs |} -> fold :} / whitesp) closesq
  parampairs  <- parampair (sep parampair)*
  parampair   <- key assign value
  key         <- whitesp { identifier }
  value       <- whitesp { [^],;%s]+ }
  multiline   <- (nl? text)+
  text        <- {| {:text: (!cmd !closebrace !%nl [_%w%p%s])+ :} {:line: '' -> getlinenum :} |}
  identifier  <- [_%w]+
  cmdslash    <- whitesp '\'
  assign      <- whitesp '='
  sep         <- whitesp ','
  openbrace   <- whitesp '{'
  closebrace  <- whitesp '}'
  opensq      <- whitesp '['
  closesq     <- whitesp ']'
  nl          <- {%nl+} -> incrementlinenum
  whitesp     <- (nl / %s)*
]], grammar_cb)

回调函数直接定义为：

local function pairfold(...)
  local t, kv = {}, ...
  if #kv % 2 == 1 then return ... end
  for i = #kv, 2, -2 do
    t[ kv[i - 1] ] = kv[i]
  end
  return t
end

local incrementlinenum, getlinenum, resetlinenum do
  local line = 1
  function incrementlinenum(nl)
    assert(not nl:match "%S")
    line = line + #nl
  end

  function getlinenum() return line end
  function resetlinenum() line = 1 end
end

使用具有多行的非平凡 tex-like str 测试语法：

  local test1 = [[\foo{text \bar[color = red, background =   black]{
  moretext \baz{
even 
more text} }


this time skipping multiple

lines even, such wow!}]]

以 lua 表格式生成以下 AST：

{
  command = "foo",
  nodes = {
    {
      text = "text",
      line = 1
    },
    {
      parameters = {
        color = "red",
        background = "black"
      },
      command = "bar",
      nodes = {
        {
          text = "  moretext",
          line = 2
        },
        {
          command = "baz",
          nodes = {
            {
              text = "even ",
              line = 3
            },
            {
              text = "more text",
              line = 4
            }
          }
        }
      }
    },
    {
      text = "this time skipping multiple",
      line = 7
    },
    {
      text = "lines even, such wow!",
      line = 9
    }
  }
}

对开始/结束环境进行第二次测试：

  local test2 = [[\begin[p1
=apple,
p2=blue]{scope} scope foobar   
\end{scope} global foobar]]

这似乎大致给出了您正在寻找的内容：

{
  {
    {
      text = " scope foobar",
      line = 3
    },
    parameters = {
      p1 = "apple",
      p2 = "blue"
    },
    beginenv = "scope",
    endenv = "scope"
  },
  {
    text = " global foobar",
    line = 4
  }
}

lua - 使用 lpeg 解析类 TeX 语言

1 回答 1

Related

Reference