php - 正则表达式的奇怪问题

Question

我有一个正则表达式：

~(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)(:(?P<function>[a-z0-9\s_-]+)([\s]?\((?P<params>[^)]*)\))?)?})(?P<contents>[^{]*(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*)*)(?P<closing>{/block:(?P=name)})~is

尝试匹配以下内容：

<ul>{block:menu}
    <li><a href="{var:link}">{var:title}</a>
{/block:menu}</ul>

哪个工作正常，但是当引入块标记的第三部分时，例如：{block:menu:thirdbit}它无法匹配它，但是如果你砍掉正则表达式的末尾以将其修剪到以下它确实匹配暗示模式是好的但是还有其他问题：

(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)(:(?P<function>[a-z0-9\s_-]+)([\s]?\((?P<params>[^)]*)\))?)?})

任何想法出了什么问题？

score 1 · Accepted Answer

只是一个想法：将所有{block:menu}和类似的元素转换为它们自己命名空间中的 XML 元素。然后，您可以使用 xpath 并完成工作。您甚至应该能够即时执行此操作。

score 1 · Accepted Answer

首先正如蒂姆正确指出的那样 - 用正则表达式解析 HTML 是不明智的。

第二：如前所述，问题中的正则表达式是不可读的。我冒昧地重新格式化它。这是一个工作脚本，其中包含完全相同的正则表达式的注释可读版本：

<?php // test.php Rev:20120830_1300
$re = '%
    # Match a non-nested "{block:name:func(params)}...{/block:name}" structure.
    (?P<opening>                      # $1: == $opening: BLOCK start tag.
      {                               # BLOCK tag opening literal "{"
      (?P<inverse>[!])?               # $2: == $inverse: Optional "!" negation.
      block:                          # Opening BLOCK tag ident.
      (?P<name>[a-z0-9\s_-]+)         # $3: == $name: BLOCK element name.
      (                               # $4: Optional BLOCK function.
        :                             # Function name preceded with ":".
        (?P<function>[a-z0-9\s_-]+)   # $function: Function name.
        (                             # $5: Optional function parameters.
          [\s]?                       # Allow one whitespace before (params).
          \(                          # Literal "(" params opening char.
          (?P<params>[^)]*)           # $6: == $params: function parameters.
          \)                          # Literal ")" params closing char.
        )?                            # End $5: Optional function parameters.
      )?                              # End $4: Optional BLOCK function.
      }                               # BLOCK tag closing literal "}"
    )                                 # End $1: == $opening: BLOCK start tag.
    (?P<contents>                     # $contents: BLOCK element contents.
      [^{]*                           # {normal) Zero or more non-"{"
      (?:                             # Begin {(special normal*)*} construct.
        \{                            # {special} Allow a "{" but only if it is
        (?!/?block:[a-z0-9\s_-]+\})   # not a BLOCK tag opening literal "{".
        [^{]*                         # More {normal}
      )*                              # Finish "Unrolling-the-Loop" (See: MRE3).
    )                                 # End $contents: BLOCK element contents.
    (?P<closing>                      # $closing: BLOCK element end tag.
      {                               # BLOCK tag opening literal "{"
      /block:                         # Closing BLOCK tag ident.
      (?P=name)                       # Close name must match open name.
      }                               # BLOCK tag closing literal "}"
    )                                 # End $closing: BLOCK element end tag.
    %six';

$text = file_get_contents('testdata.html');
if (preg_match($re, $text, $matches)) print_r($matches);
else echo("no match!");
?>

请注意，额外的缩进和注释允许人们真正理解正则表达式试图做什么。我的测试表明正则表达式没有任何问题，并且它可以像宣传的那样工作。它甚至实现了 Jeffrey Friedl 先进的“Unrolling-the-Loop”效率技术，所以写这篇文章的人有一些真正的正则表达式技巧。

例如，鉴于以下数据取自原始问题：

<ul>{block:menu}
    <li><a href="{var:link}">{var:title}</a>
{/block:menu}</ul>

这是脚本的（正确）输出：

'''
Array
(
    [0] => {block:menu}
    <li><a href="{var:link}">{var:title}</a>
{/block:menu}
    [opening] => {block:menu}
    [1] => {block:menu}
    [inverse] =>
    [2] =>
    [name] => menu
    [3] => menu
    [4] =>
    [function] =>
    [5] =>
    [6] =>
    [params] =>
    [7] =>
    [contents] =>
    <li><a href="{var:link}">{var:title}</a>

    [8] =>
    <li><a href="{var:link}">{var:title}</a>

    [closing] => {/block:menu}
    [9] => {/block:menu}
)
'''

当可选function并且params包含在测试数据中时，它也可以工作。

也就是说，我对问题/正则表达式有一些问题：

正则表达式混合命名和编号的捕获组。
{and是元字符，}应该转义（尽管 PCRE 能够正确确定在这种情况下它们应该按字面意思解释）。
目前尚不清楚用户将如何使用可选捕获的组。
目前尚不清楚 OP 使用此正则表达式有什么问题。

php - 正则表达式的奇怪问题

2 回答 2

Related

Reference