php - 在 PHP 中合并正则表达式

Question

假设我有以下两个包含正则表达式的字符串。我如何合并它们？更具体地说，我希望将这两个表达式作为替代。

$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'

当然，将其作为字符串操作执行是不切实际的，因为它会涉及解析表达式、构造语法树、合并树，然后输出另一个与树等效的正则表达式。没有这最后一步，我很高兴。不幸的是，PHP 没有 RegExp 类（或者有吗？）。

有什么办法可以做到这一点？顺便说一句，还有其他语言提供方法吗？这不是很正常的情况吗？可能不会。:-(

或者，有没有一种方法可以有效地检查两个表达式中的任何一个是否匹配，以及哪个更早匹配（如果它们在同一位置匹配，则哪个匹配更长）？这就是我目前正在做的事情。不幸的是，我经常在长字符串上这样做，用于两个以上的模式。结果很慢（是的，这绝对是瓶颈）。

编辑：

我应该更具体——对不起。$a并且$b是变量，它们的内容不在我的控制范围内！否则，我只会手动合并它们。因此，我无法对使用的分隔符或正则表达式修饰符做出任何假设。请注意，例如，我的第一个表达式使用i修饰符（忽略大小写），而第二个表达式使用x（扩展语法）。因此，我不能只连接两者，因为第二个表达式不忽略大小写，第一个不使用扩展语法（并且其中的任何空格都很重要！

score 3 · Accepted Answer

从每个中去除分隔符和标志。这个正则表达式应该这样做：
```
/^(.)(.*)\1([imsxeADSUXJu]*)$/
```
将表达式连接在一起。您需要非捕获括号来注入标志：
```
"(?$flags1:$regexp1)|(?$flags2:$regexp2)"
```
如果有任何反向引用，计算捕获括号并相应地更新反向引用（例如正确连接/(.)x\1/和/(.)y\1/is /(.)x\1|(.)y\2/）。

score 3 · Accepted Answer

我看到porneL 实际上描述了很多这样的东西，但这解决了大部分问题。它取消在先前子表达式中设置的修饰符（另一个答案错过了）并设置每个子表达式中指定的修饰符。它还处理非斜杠分隔符（我找不到此处允许使用哪些字符的规范，所以我使用了.，您可能想进一步缩小范围）。

一个弱点是它不处理表达式中的反向引用。我最担心的是反向引用本身的局限性。我将把它作为练习留给读者/提问者。

// Pass as many expressions as you'd like
function preg_magic_coalesce() {
    $active_modifiers = array();

    $expression = '/(?:';
    $sub_expressions = array();
    foreach(func_get_args() as $arg) {
        // Determine modifiers from sub-expression
        if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {
            $modifiers = preg_split('//', $matches[3]);
            if($modifiers[0] == '') {
                array_shift($modifiers);
            }
            if($modifiers[(count($modifiers) - 1)] == '') {
                array_pop($modifiers);
            }

            $cancel_modifiers = $active_modifiers;
            foreach($cancel_modifiers as $key => $modifier) {
                if(in_array($modifier, $modifiers)) {
                    unset($cancel_modifiers[$key]);
                }
            }
            $active_modifiers = $modifiers;
        } elseif(preg_match('/(.)(.*)\1$/', $arg)) {
            $cancel_modifiers = $active_modifiers;
            $active_modifiers = array();
        }

        // If expression has modifiers, include them in sub-expression
        $sub_modifier = '(?';
        $sub_modifier .= implode('', $active_modifiers);

        // Cancel modifiers from preceding sub-expression
        if(count($cancel_modifiers) > 0) {
            $sub_modifier .= '-' . implode('-', $cancel_modifiers);
        }

        $sub_modifier .= ')';

        $sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);

        // Properly escape slashes
        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);

        $sub_expressions[] = $sub_expression;
    }

    // Join expressions
    $expression .= implode('|', $sub_expressions);

    $expression .= ')/';
    return $expression;
}

编辑：我重写了这个（因为我是强迫症）并最终得到：

function preg_magic_coalesce($expressions = array(), $global_modifier = '') {
    if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {
        $global_modifier = '';
    }

    $expression = '/(?:';
    $sub_expressions = array();
    foreach($expressions as $sub_expression) {
        $active_modifiers = array();
        // Determine modifiers from sub-expression
        if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {
            $active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',
                $matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
        }

        // If expression has modifiers, include them in sub-expression
        if(count($active_modifiers) > 0) {
            $replacement = '(?';
            $replacement .= implode('', $active_modifiers);
            $replacement .= ':$2)';
        } else {
            $replacement = '$2';
        }

        $sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',
            $replacement, $sub_expression);

        // Properly escape slashes if another delimiter was used
        $sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);

        $sub_expressions[] = $sub_expression;
    }

    // Join expressions
    $expression .= implode('|', $sub_expressions);

    $expression .= ')/' . $global_modifier;
    return $expression;
}

它现在使用(?modifiers:sub-expression)而不是(?modifiers)sub-expression|(?cancel-modifiers)sub-expression，但我注意到两者都有一些奇怪的修饰符副作用。例如，在这两种情况下，如果子表达式有/u修饰符，它将无法匹配（但如果您'u'作为新函数的第二个参数传递，则匹配得很好）。

score 3 · Accepted Answer

编辑

我重写了代码！它现在包含如下所列的更改。此外，我已经进行了广泛的测试（我不会在这里发布，因为它们太多了）来查找错误。到目前为止，我还没有找到任何东西。

该函数现在分为两部分：有一个单独的函数preg_split，它接受一个正则表达式并返回一个包含裸表达式（不带分隔符）的数组和一个修饰符数组。这可能会派上用场（事实上，它已经派上用场了；这就是我进行此更改的原因）。
该代码现在可以正确处理反向引用。毕竟，这对我的目的来说是必要的。不难添加，用于捕获反向引用的正则表达式看起来很奇怪（实际上可能效率极低，对我来说看起来 NP 很难——但这只是一种直觉，只适用于奇怪的边缘情况） . 顺便说一句，有没有人知道比我的方法更好的检查奇数匹配的方法？否定的lookbehinds在这里不起作用，因为它们只接受固定长度的字符串而不是正则表达式。但是，我需要这里的正则表达式来测试前面的反斜杠是否真的被转义了。

此外，我不知道 PHP 在缓存匿名create_function使用方面有多好。就性能而言，这可能不是最好的解决方案，但似乎已经足够好了。
我已经修复了健全性检查中的一个错误。
由于我的测试表明没有必要，我已经删除了对过时修饰符的取消。

顺便说一句，这段代码是我在 PHP 中使用的各种语言的语法高亮器的核心组件之一，因为我对其他地方列出的替代方案不满意。

谢谢！

porneL，无眼睑，惊人的工作！非常感谢。我其实已经放弃了。

我已经建立在您的解决方案之上，我想在这里分享它。~~我没有实现重新编号反向引用，因为这与我的情况无关（我认为......）。不过，也许这将在以后变得必要。~~

一些问题……</h2>
一件事，@eyelidlessness：为什么你觉得有必要取消旧的修饰符？据我所知，这不是必需的，因为无论如何修饰符仅在本地应用。啊，是的，另一件事。您对分隔符的转义似乎过于复杂。愿意解释为什么你认为这是必要的吗？我相信我的版本应该也可以，但我可能错了。

此外，我已经更改了您的函数的签名以符合我的需要。我还认为我的版本更普遍有用。再说一次，我可能错了。

顺便说一句，您现在应该意识到实名对 SO 的重要性。;-) 我不能在代码中给你真正的功劳。：-/

编码

无论如何，我想分享我到目前为止的结果，因为我无法相信没有其他人需要这样的东西。该代码似乎运行良好。~~不过，尚未进行广泛的测试。~~ 请给出意见！

事不宜迟……</p>

/**
 * Merges several regular expressions into one, using the indicated 'glue'.
 *
 * This function takes care of individual modifiers so it's safe to use
 * <em>different</em> modifiers on the individual expressions. The order of
 * sub-matches is preserved as well. Numbered back-references are adapted to
 * the new overall sub-match count. This means that it's safe to use numbered
 * back-refences in the individual expressions!
 * If {@link $names} is given, the individual expressions are captured in
 * named sub-matches using the contents of that array as names.
 * Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
 * <strong>not</strong> supported.
 *
 * The function assumes that all regular expressions are well-formed.
 * Behaviour is undefined if they aren't.
 *
 * This function was created after a {@link https://stackoverflow.com/questions/244959/
 * StackOverflow discussion}. Much of it was written or thought of by
 * “porneL” and “eyelidlessness”. Many thanks to both of them.
 *
 * @param string $glue  A string to insert between the individual expressions.
 *      This should usually be either the empty string, indicating
 *      concatenation, or the pipe (<code>|</code>), indicating alternation.
 *      Notice that this string might have to be escaped since it is treated
 *      like a normal character in a regular expression (i.e. <code>/</code>)
 *      will end the expression and result in an invalid output.
 * @param array $expressions    The expressions to merge. The expressions may
 *      have arbitrary different delimiters and modifiers.
 * @param array $names  Optional. This is either an empty array or an array of
 *      strings of the same length as {@link $expressions}. In that case,
 *      the strings of this array are used to create named sub-matches for the
 *      expressions.
 * @return string An string representing a regular expression equivalent to the
 *      merged expressions. Returns <code>FALSE</code> if an error occurred.
 */
function preg_merge($glue, array $expressions, array $names = array()) {
    // … then, a miracle occurs.

    // Sanity check …

    $use_names = ($names !== null and count($names) !== 0);

    if (
        $use_names and count($names) !== count($expressions) or
        !is_string($glue)
    )
        return false;

    $result = array();
    // For keeping track of the names for sub-matches.
    $names_count = 0;
    // For keeping track of *all* captures to re-adjust backreferences.
    $capture_count = 0;

    foreach ($expressions as $expression) {
        if ($use_names)
            $name = str_replace(' ', '_', $names[$names_count++]);

        // Get delimiters and modifiers:

        $stripped = preg_strip($expression);

        if ($stripped === false)
            return false;

        list($sub_expr, $modifiers) = $stripped;

        // Re-adjust backreferences:

        // We assume that the expression is correct and therefore don't check
        // for matching parentheses.

        $number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);

        if ($number_of_captures === false)
            return false;

        if ($number_of_captures > 0) {
            // NB: This looks NP-hard. Consider replacing.
            $backref_expr = '/
                (                # Only match when not escaped:
                    [^\\\\]      # guarantee an even number of backslashes
                    (\\\\*?)\\2  # (twice n, preceded by something else).
                )
                \\\\ (\d)        # Backslash followed by a digit.
            /x';
            $sub_expr = preg_replace_callback(
                $backref_expr,
                create_function(
                    '$m',
                    'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
                ),
                $sub_expr
            );
            $capture_count += $number_of_captures;
        }

        // Last, construct the new sub-match:

        $modifiers = implode('', $modifiers);
        $sub_modifiers = "(?$modifiers)";
        if ($sub_modifiers === '(?)')
            $sub_modifiers = '';

        $sub_name = $use_names ? "?<$name>" : '?:';
        $new_expr = "($sub_name$sub_modifiers$sub_expr)";
        $result[] = $new_expr;
    }

    return '/' . implode($glue, $result) . '/';
}

/**
 * Strips a regular expression string off its delimiters and modifiers.
 * Additionally, normalize the delimiters (i.e. reformat the pattern so that
 * it could have used '/' as delimiter).
 *
 * @param string $expression The regular expression string to strip.
 * @return array An array whose first entry is the expression itself, the
 *      second an array of delimiters. If the argument is not a valid regular
 *      expression, returns <code>FALSE</code>.
 *
 */
function preg_strip($expression) {
    if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
        return false;

    $delim = $matches[1];
    $sub_expr = $matches[2];
    if ($delim !== '/') {
        // Replace occurrences by the escaped delimiter by its unescaped
        // version and escape new delimiter.
        $sub_expr = str_replace("\\$delim", $delim, $sub_expr);
        $sub_expr = str_replace('/', '\\/', $sub_expr);
    }
    $modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));

    return array($sub_expr, $modifiers);
}

PS：我已将这个发布社区 wiki 设为可编辑。你知道这是什么意思 …！

score 1 · Accepted Answer

我很确定不可能像这样用任何语言将正则表达式放在一起——它们可能有不兼容的修饰符。

我可能只是将它们放在一个数组中并循环遍历它们，或者手动组合它们。

编辑：如果您按照编辑中的描述一次做一个，您也许可以在子字符串上运行第二个（从开始到最早的匹配）。这可能会有所帮助。

score 0 · Accepted Answer

function preg_magic_coalasce($split, $re1, $re2) {
  $re1 = rtrim($re1, "\/#is");
  $re2 = ltrim($re2, "\/#");
  return $re1.$split.$re2;
}

score 0 · Accepted Answer

你可以这样做：

$a = '# /[a-z] #i';
$b = '/ Moo /x';

$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);

if ($a_matched && $b_matched) {
    $a_pos = strpos($text, $a_matches[1]);
    $b_pos = strpos($text, $b_matches[1]);

    if ($a_pos == $b_pos) {
        if (strlen($a_matches[1]) == strlen($b_matches[1])) {
            // $a and $b matched the exact same string
        } else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
            // $a and $b started matching at the same spot but $a is longer
        } else {
            // $a and $b started matching at the same spot but $b is longer
        }
    } else if ($a_pos < $b_pos) {
        // $a matched first
    } else {
        // $b matched first
    }
} else if ($a_matched) {
    // $a matched, $b didn't
} else if ($b_matched) {
    // $b matched, $a didn't
} else {
    // neither one matched
}

php - 在 PHP 中合并正则表达式

编辑：

6 回答 6

编辑

谢谢！

编码

Related

Reference