我知道解析嵌套字符串或 HTML 最好由真正的解析器完成,但在我的情况下,我有简单的模板,并且想从模板中提取 Wiki 参数“标题”的标题内容。我花了一段时间才实现这一点,但感谢 Lars Olav Torvik ( http://regex.larsolavtorvik.com/ ) 的正则表达式工具和这个用户论坛,我得到了它。可能有人觉得它有用。(我们都想贡献,他,不是吗?;-) 以下带有注释的代码可以解决问题。我必须通过环顾断言来做到这一点,以便在其中一个没有标题的情况下将两个模板混合在一起。
我还不确定正则表达式注释中的两个问题——看看(?# Questions: …)
——我是否理解(?R)
. 是不是它从最外层定义的级别(即第二个正则表达式行\{\{
和最后一个正则表达式行)获取要检查的内容\}\}
?那会是正确的吗?++
和+
之前的替代展位工作有什么区别(?R)
,所以在测试时似乎如此。
页面上的原始 wiki 模板(最简单):
$wikiTemplate = " {{Templ1 | title = (1. template) title }} {{Templ2 | any parameter = something {{template}} }} {{Templ1 | title = (3. template) title }} ";
替换:
$wikiTemplate = preg_replace( array( // tag all templates with START … END and add a TITLE-placeholder before // and take care of balanced {{ … }} recursiveness "@(?s) (?# switch to dotall match, i.e. also linebreaks ) \{\{ (?# find two {{ ) (?: (?# group 1 as a non-backreferenced match ) (?: (?# group 2 as a non-backreferenced match ) (?! (?# in group 1 anything but not {{ or }} ) \{\{ | (?# or ) \}\} ) . )++ (?# Question: what is the differenc between ++ and + here? ) | (?# or ) (?R) (?# is it recursive of what is defined in the outermost, i.e. 2nd regexp line with \{\{ and last line with \}\} Question: is that here understood correctly? ) ) * (?# zero or many times of the inner regexp defintions ) \}\} (?# find two }} ) @x",// x-extended → ignore white space in the pattern // replace TITLE by single line content of title parameter "@ (?<=TITLE) (?# TITLE must preceed the following linebreak but is not backreferenced within \\0, i.e. the whole returned match) ([\n\r]+) (?#linebr in 1 may also described as . because of s-modifier dotall) (?: (?# start non-backreferenced match ) . (?# any character but not followed by START) (?!START) )+ (?# multiple times) (?: (?# start non-backreferenced match ) \|\s*title\s*=\s* (?#find the parameter '| title = ') ) ([^\r\n]+) (?#get title now to \\2 but exclude the line break. Note it is buggy when there is no line break ) (?: (?# start non-backreferenced match ) . (?# any character but not followed by END) (?!END) ) + (?# multiple times) . (?# any single character, e.g. the last because as all stuff before captures anything not followed by END) (?:END) (?#a not backreferenced END) @msx", // m-multiline, s-dotall match also linebreaks, // x-extended → ignore white space in the pattern ), array( "TITLE\nSTART\\0END", // \0 is the whole returned match, i.e. the template # replace the TITLE to TITLEtitle contentTITLE… "\\2TITLE\\0", ), $wikiTemplate ); print_r($wikiTemplate);
然后输出带有每个模板上方由 TITLE 标记的标题,但前提是有标题:
TITLE(1. template) titleTITLE START{{Templ1 | title = (1. template) title }}END TITLE START{{Templ2 | any parameter = something {{template}} }}END TITLE(3. template) titleTITLE START{{Templ1 | title = (3. template) title }}END
关于正则表达式理解或一些改进的问题有什么问题吗?谢谢,安德烈亚斯。