php - PHP正则表达式崩溃apache

Question

我有一个匹配模板系统的正则表达式，不幸的是，它似乎在一些适度琐碎的查找中使 apache （它在 Windows 上运行）崩溃。我已经研究了这个问题，并且有一些关于增加堆栈大小等的建议，但这些建议似乎都不起作用，而且我真的不喜欢通过增加限制来处理这些问题，因为它通常只是将 bug 推到了未来。

无论如何，关于如何更改正则表达式以使其不太可能出错的任何想法？

这个想法是抓住最里面的块（在这种情况下{block:test}This should be caught first!{/block:test}），然后我将 str_replace 出开始/结束标签并通过正则表达式重新运行整个事情，直到没有剩下的块。

正则表达式：

~(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)})(?P<contents>(?:(?!{/?block:[0-9a-z-_]+}).)*)(?P<closing>{/block:\3})~ism

样本模板：

<div class="f_sponsors s_banners">
    <div class="s_previous">&laquo;</div>
    <div class="s_sponsors">
        <ul>
            {block:sponsors}
            <li>
                <a href="{var:url}" target="_blank">
                    <img src="image/160x126/{var:image}" alt="{var:name}" title="{var:name}" />
                </a>
            {block:test}This should be caught first!{/block:test}
            </li>
            {/block:sponsors}
        </ul>
    </div>
    <div class="s_next">&raquo;</div>
</div>

我想这是一个长镜头。:(

score 4 · Accepted Answer

您可以使用atomic group: (?>...)or来抑制/限制回溯并通过技术possessive quantifiers: ?+ *+ ++..加速匹配。unrolling loop我的解决方案：

\{block:(\w++)\}([^<{]++(?:(?!\{\/?block:\1\b)[<{][^<{]*+)*+)\{/block:\1\}

我已经从http://regexr.com?31p03进行了测试。

匹配{block:sponsors}...{/block:sponsors}：
\{block:(sponsors)\}([^<{]++(?:(?!\{\/?block:\1\b)[<{][^<{]*+)*+)\{/block:\1\}
http ://regexr.com?31rb3

匹配{block:test}...{/block:test}：
\{block:(test)\}([^<{]++(?:(?!\{\/?block:\1\b)[<{][^<{]*+)*+)\{/block:\1\}
http ://regexr.com?31rb6

另一个解决方案：
在 PCRE 源代码中，您可以从以下位置删除注释config.h：
/* #undef NO_RECURSE */

以下文本副本来自config.h：
PCRE 使用递归函数调用来处理匹配时的回溯。这有时可能是堆栈大小有限的系统的问题。定义 NO_RECURSE 以获取在 match() 函数中不使用递归的版本；相反，它通过使用 pcre_recurse_malloc() 创建自己的堆栈以从堆中获取内存。

或者你可以改变pcre.backtrack_limit和pcre.recursion_limit从php.ini（http://www.php.net/manual/en/pcre.configuration.php）

score 4 · Accepted Answer

试试这个：

'~(?P<opening>\{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)\})(?P<contents>[^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*)(?P<closing>\{/block:(?P=name)\})~i'

或者，以可读的形式：

'~(?P<opening>
  \{
  (?P<inverse>[!])?
  block:
  (?P<name>[a-z0-9\s_-]+)
  \}
)
(?P<contents>
  [^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*
)
(?P<closing>
  \{
  /block:(?P=name)
  \}
)~ix'

最重要的部分是在(?P<contents>..)组中：

[^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*

一开始，我们唯一感兴趣的字符是左大括号，所以我们可以用 .slurp 任何其他字符[^{]*。只有在我们看到 a 之后，我们才会检查它是否是标签{的开头。{/block}如果不是，我们继续消费它并开始扫描下一个，并根据需要重复。

{block:sponsors}使用 RegexBuddy，我通过将光标放在标记开头并进行调试来测试每个正则表达式。然后我从结束{/block:sponsors}标记中删除了结束大括号以强制匹配失败并再次调试它。你的正则表达式成功了 940 步，失败了 2265 步。我的成功了 57 步，失败了 83 步。

在旁注中，我删除了s修饰符是因为我没有使用点 ( .)，而m修饰符是因为它从来不需要。我还使用了命名的反向引用 (?P=name)，而不是\3按照@DaveRandom 的优秀建议。而且我避开了所有的大括号（{和}），因为我发现这样更容易阅读。

编辑：如果你想匹配最里面的命名块，改变正则表达式的中间部分：

(?P<contents>
  [^{]*(?:\{(?!/block:(?P=name)\})[^{]*)*
)

...对此（正如@Kobi 在他的评论中所建议的那样）：

(?P<contents>
  [^{]*(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*)*
)

最初，该(?P<opening>...)组会抓取它看到的第一个开始标签，然后该(?P<contents>..)组会消费任何东西——包括其他标签——只要它们不是与该(?P<opening>...)组找到的结束标签匹配的结束标签。（然后(?P<closing>...)小组会继续消费。）

现在，该(?P<contents>...)组拒绝匹配任何标签，打开或关闭（注意/?开头的），无论名称是什么。所以正则表达式最初开始匹配{block:sponsors}标签，但当它遇到{block:test}标签时，它会放弃匹配并返回搜索开始标签。它再次从{block:test}标签开始，这一次在找到结束标签时成功完成匹配{/block:test}。

像这样描述它听起来效率低下，但事实并非如此。我之前描述的技巧，啜饮非牙套，淹没了这些错误开始的影响。在几乎每个位置都进行负前瞻的地方，现在只有在遇到{. 正如@godspeedlee 建议的那样，您甚至可以使用所有格量词：

(?P<contents>
  [^{]*+(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*+)*+
)

...因为你知道它永远不会消耗它以后必须回馈的任何东西。这会加快速度，但这并不是必需的。

score 4 · Accepted Answer

解决方案是否必须是单个正则表达式？一种更有效的方法可能是简单地查找第一次出现的{/block:（可能是简单的字符串搜索或正则表达式），然后从该点向后搜索以找到其匹配的开始标签，适当地替换跨度并重复直到没有更多块。如果每次都从模板顶部开始查找第一个结束标记，那么这将为您提供嵌套最深的块。

镜像算法也可以工作 - 查找最后一个开始标签，然后从那里向前搜索相应的结束标签：

<?php

$template = //...

while(true) {
  $last_open_tag = strrpos($template, '{block:');
  $last_inverted_tag = strrpos($template, '{!block:');
  // $block_start is the index of the '{' of the last opening block tag in the
  // template, or false if there are no more block tags left
  $block_start = max($last_open_tag, $last_inverted_tag);
  if($block_start === false) {
    // all done
    break;
  } else {
    // extract the block name (the foo in {block:foo}) - from the character
    // after the next : to the character before the next }, inclusive
    $block_name_start = strpos($template, ':', $block_start) + 1;
    $block_name = substr($template, $block_name_start,
        strcspn($template, '}', $block_name_start));

    // we now have the start tag and the block name, next find the end tag.
    // $block_end is the index of the '{' of the next closing block tag after
    // $block_start.  If this doesn't match the opening tag something is wrong.
    $block_end = strpos($template, '{/block:', $block_start);
    if(strpos($template, $block_name.'}', $block_end + 8) !== $block_end + 8) {
      // non-matching tag
      print("Non-matching tag found\n");
      break;
    } else {
      // now we have found the innermost block
      // - its start tag begins at $block_start
      // - its content begins at
      //   (strpos($template, '}', $block_start) + 1)
      // - its content ends at $block_end
      // - its end tag ends at ($block_end + strlen($block_name) + 9)
      //   [9 being the length of '{/block:' plus '}']
      // - the start tag was inverted iff $block_start === $last_inverted_tag
      $template = // do whatever you need to do to replace the template
    }
  }
}

echo $template;

php - PHP正则表达式崩溃apache

3 回答 3

Related

Reference