php - 解析平衡的嵌套 wiki 模板并通过正则表达式提取单行参数的内容

Question

我知道解析嵌套字符串或 HTML 最好由真正的解析器完成，但在我的情况下，我有简单的模板，并且想从模板中提取 Wiki 参数“标题”的标题内容。我花了一段时间才实现这一点，但感谢 Lars Olav Torvik ( http://regex.larsolavtorvik.com/ ) 的正则表达式工具和这个用户论坛，我得到了它。可能有人觉得它有用。（我们都想贡献，他，不是吗？;-) 以下带有注释的代码可以解决问题。我必须通过环顾断言来做到这一点，以便在其中一个没有标题的情况下将两个模板混合在一起。

我还不确定正则表达式注释中的两个问题——看看(?# Questions: …)——我是否理解(?R). 是不是它从最外层定义的级别（即第二个正则表达式行\{\{和最后一个正则表达式行）获取要检查的内容\}\}？那会是正确的吗？++和+之前的替代展位工作有什么区别(?R)，所以在测试时似乎如此。

页面上的原始 wiki 模板（最简单）：

$wikiTemplate = "
{{Templ1
| title = (1. template) title
}}

{{Templ2
| any parameter = something {{template}}
}}

{{Templ1
| title = (3. template) title
}}
";

替换：

$wikiTemplate = preg_replace(
  array(
  // tag all templates with START … END and add a TITLE-placeholder before
  // and take care of balanced {{ …  }} recursiveness 
    "@(?s)   (?# switch to dotall match, i.e. also linebreaks )
      \{\{ (?# find two {{ )
      (?: (?# group 1 as a non-backreferenced match  )
        (?:  (?# group 2 as a non-backreferenced match  )
          (?! (?# in group 1 anything but not {{ or }} )
            \{\{ 
            |   (?# or )
            \}\}
          )
          .
        )++  (?# Question: what is the differenc between ++ and + here? )
        |    (?# or )
        (?R) (?# is it recursive of what is defined in the outermost,
              i.e. 2nd regexp line with \{\{ and last line with \}\}
              Question: is that here understood correctly? ) 
      )
      * (?# zero or many times of the inner regexp defintions )
      \}\} (?# find two }} )
    @x",// x-extended → ignore white space in the pattern
  // replace TITLE by single line content of title parameter 
    "@
      (?<=TITLE) (?# TITLE must preceed the following linebreak but is not
                  backreferenced within \\0, i.e. the whole returned match)
      ([\n\r]+)  (?#linebr in 1 may also described as . because of
                  s-modifier dotall)
      (?:        (?# start non-backreferenced match )
        .        (?# any character but not followed by START)
        (?!START)
      )+      (?# multiple times)
      (?:     (?# start non-backreferenced match )
        \|\s*title\s*=\s* (?#find the parameter '| title = ')
      )
      ([^\r\n]+)  (?#get title now to \\2 but exclude the line break. 
                   Note it is buggy when there is no line break )
      (?:     (?# start non-backreferenced match )
        .     (?# any character but not followed by END)
        (?!END)
      )
      +       (?# multiple times)
      .       (?# any single character, e.g. the last  because as all
               stuff before captures anything not followed by END)
      (?:END) (?#a not backreferenced END)
    @msx", // m-multiline, s-dotall match also linebreaks,
           // x-extended → ignore white space in the pattern
  ), 
  array(
    "TITLE\nSTART\\0END", // \0 is the whole returned match, i.e. the template
  # replace the TITLE to  TITLEtitle contentTITLE…
    "\\2TITLE\\0",
  ),
  $wikiTemplate
);
print_r($wikiTemplate);

然后输出带有每个模板上方由 TITLE 标记的标题，但前提是有标题：

TITLE(1. template) titleTITLE
START{{Templ1
 | title = (1. template) title
}}END

TITLE
START{{Templ2
 | any parameter = something {{template}}
}}END

TITLE(3. template) titleTITLE
START{{Templ1
 | title = (3. template) title
}}END

关于正则表达式理解或一些改进的问题有什么问题吗？谢谢，安德烈亚斯。

score 0 · Accepted Answer

++是所有格量词。如果您在任何重复量词 ( +, *, {...}) 后面附加 a +，它就会变成所有格。这意味着正则表达式引擎将不会回溯并尝试更少的重复，一旦它第一次离开重复。所以他们基本上使重复成为一个原子组。有时这是一种优化，有时它实际上会有所作为。你可以在这里做一些很好的阅读。

关于你的第二个问题， yes (?R)只会尝试再次匹配完整模式。为此，可以在 PCRE 的 PHP 文档中找到一篇很好的文章。

对于您的其他问题，更好的提问位置可能是Code Review。

php - 解析平衡的嵌套 wiki 模板并通过正则表达式提取单行参数的内容

1 回答 1

Related

Reference