php - 突出显示 preg_match_all() 的主题字符串中的匹配结果

Question

我正在尝试使用 preg_match_all() 返回的 $matches 数组突出显示主题字符串。让我从一个例子开始：

preg_match_all("/(.)/", "abc", $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);

这将返回：

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => a
                    [1] => 0
                )

            [1] => Array
                (
                    [0] => a
                    [1] => 0
                )

        )

    [1] => Array
        (
            [0] => Array
                (
                    [0] => b
                    [1] => 1
                )

            [1] => Array
                (
                    [0] => b
                    [1] => 1
                )

        )

    [2] => Array
        (
            [0] => Array
                (
                    [0] => c
                    [1] => 2
                )

            [1] => Array
                (
                    [0] => c
                    [1] => 2
                )

        )

)

在这种情况下，我想做的是突出显示整体消耗的数据和每个反向引用。

输出应如下所示：

<span class="match0">
    <span class="match1">a</span>
</span>
<span class="match0">
    <span class="match1">b</span>
</span>
<span class="match0">
    <span class="match1">c</span>
</span>

另一个例子：

preg_match_all("/(abc)/", "abc", $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);

应该返回：

<span class="match0"><span class="match1">abc</span></span>

我希望这足够清楚。

我想突出显示整体消费数据并突出显示每个反向引用。

提前致谢。如果有什么不清楚的，请询问。

注意：它不能破坏 html。正则表达式 AND 输入字符串在代码中都是未知的并且是完全动态的。所以搜索字符串可以是 html，匹配的数据可以包含类似 html 的文本，什么不是。

score 3 · Accepted Answer

到目前为止，这似乎对我抛出的所有示例都是正确的。请注意，为了在其他情况下的可重用性，我已经打破了 HTML 处理部分中的抽象突出显示部分：

<?php

/**
 * Runs a regex against a string, and return a version of that string with matches highlighted
 * the outermost match is marked with [0]...[/0], the first sub-group with [1]...[/1] etc
 *
 * @param string $regex Regular expression ready to be passed to preg_match_all
 * @param string $input
 * @return string
 */
function highlight_regex_matches($regex, $input)
{
    $matches = array();
    preg_match_all($regex, $input, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);

    // Arrange matches into groups based on their starting and ending offsets
    $matches_by_position = array();
    foreach ( $matches as $sub_matches )
    {
            foreach ( $sub_matches as $match_group => $match_data )
            {
                    $start_position = $match_data[1];
                    $end_position = $start_position + strlen($match_data[0]);

                    $matches_by_position[$start_position]['START'][] = $match_group;

                    $matches_by_position[$end_position]['END'][] = $match_group;
            }
    }

    // Now proceed through that array, annotoating the original string
    // Note that we have to pass through BACKWARDS, or we break the offset information
    $output = $input;
    krsort($matches_by_position);
    foreach ( $matches_by_position as $position => $matches )
    {
            $insertion = '';

            // First, assemble any ENDING groups, nested highest-group first
            if ( is_array($matches['END']) )
            {
                    krsort($matches['END']);
                    foreach ( $matches['END'] as $ending_group )
                    {
                            $insertion .= "[/$ending_group]";
                    }
            }

            // Then, any STARTING groups, nested lowest-group first
            if ( is_array($matches['START']) )
            {
                    ksort($matches['START']);
                    foreach ( $matches['START'] as $starting_group )
                    {
                            $insertion .= "[$starting_group]";
                    }
            }

            // Insert into output
            $output = substr_replace($output, $insertion, $position, 0);
    }

    return $output;
}

/**
 * Given a regex and a string containing unescaped HTML, return a blob of HTML
 * with the original string escaped, and matches highlighted using <span> tags
 *
 * @param string $regex Regular expression ready to be passed to preg_match_all
 * @param string $input
 * @return string HTML ready to display :)
 */
function highlight_regex_as_html($regex, $raw_html)
{
    // Add the (deliberately non-HTML) highlight tokens
    $highlighted = highlight_regex_matches($regex, $raw_html);

    // Escape the HTML from the input
    $highlighted = htmlspecialchars($highlighted);

    // Substitute the match tokens with desired HTML
    $highlighted = preg_replace('#\[([0-9]+)\]#', '<span class="match\\1">', $highlighted);
    $highlighted = preg_replace('#\[/([0-9]+)\]#', '</span>', $highlighted);

    return $highlighted;
}

注意：正如 hakra 在聊天中向我指出的那样，如果正则表达式中的子组可以在一个整体匹配中多次出现（例如 '/a(b|c)+/'），preg_match_all只会告诉你最后一个这些匹配 - 所以highlight_regex_matches('/a(b|c)+/', 'abc')返回'[0]ab[1]c[/1][/0]'不是'[0]a[1]b[/1][1]c[/1][/0]'你可能期望/想要的。外面的所有匹配组仍然可以正常工作，所以highlight_regex_matches('/a((b|c)+)/', 'abc')给出'[0]a[1]b[2]c[/2][/1][/0]'这仍然是正则表达式如何匹配的一个很好的指示。

score 0 · Accepted Answer

阅读您在第一个答案下的评论，我很确定您并没有真正按照您的意图提出问题。但是，按照您具体要求的内容是：

$pattern = "/(.)/";
$subject = "abc";

$callback = function($matches) {
    if ($matches[0] !== $matches[1]) {
        throw new InvalidArgumentException(
            sprintf('you do not match thee requirements, go away: %s'
                    , print_r($matches, 1))
        );
    }
    return sprintf('<span class="match0"><span class="match1">%s</span></span>'
                   , htmlspecialchars($matches[1]));
};
$result = preg_replace_callback($pattern, $callback, $subject);

在你现在开始抱怨之前，先看看你描述问题的缺点在哪里。我有一种感觉，您实际上想要实际解析匹配结果。但是，您想做子匹配。除非您同时解析正则表达式以找出使用了哪些组，否则这不起作用。到目前为止，情况并非如此，在你的问题中也不是在这个答案中。

因此，请将此示例仅用于一个子组，该子组也必须是整个模式作为要求。除此之外，这是完全动态的。

有关的：

score 0 · Accepted Answer

我对在 stackoverflow 上发帖不太熟悉，所以我希望我不要搞砸了。我这样做的方式与@IMSoP 几乎相同，但是略有不同：

我像这样存储标签：

$tags[ $matched_pos ]['open'][$backref_nr] = "open tag";
$tags[ $matched_pos + $len ]['close'][$backref_nr] = "close tag";

如您所见，几乎与@IMSoP 相同。

然后我像这样构造字符串，而不是像 @IMSoP 那样插入和排序：

$finalStr = "";
for ($i = 0; $i <= strlen($text); $i++) {
    if (isset($tags[$i])) {
        foreach ($tags[$i] as $tag) {
            foreach ($tag as $span) {
                $finalStr .= $span;
            }
        }
    }
    $finalStr .= $text[$i];
}

$text使用的文本在哪里preg_match_all()

我认为我的解决方案比@IMSoP 的解决方案要快一些，因为他每次都必须进行排序，而不是什么。但我不确定。

我现在主要担心的是性能。但它可能无法让它比这更快地工作吗？

我一直试图让递归preg_replace_callback()的事情发生，但到目前为止我还不能让它工作。preg_replace_callback() 似乎非常非常快。无论如何，比我目前正在做的要快得多。

score -1 · Accepted Answer

快速混搭，为什么要使用正则表达式？

$content = "abc";
$endcontent = "";

for($i = 0; $i > strlen($content); $i++)
{
    $endcontent .= "<span class=\"match0\"><span class=\"match1\">" . $content[$i] . "</span></span>";
}

echo $endcontent;

php - 突出显示 preg_match_all() 的主题字符串中的匹配结果

4 回答 4

Related

Reference