php - 正则表达式的帮助

Question

我感到羞耻，但我仍然不清楚一些正则表达式方面。我需要解析包含许多@"I'm a string"格式字符串文字的文本文件。我已经编写了简单的模式/@"([^"]*)"/si。它工作完美， preg_match_all 返回一个集合。但显然，如果字符串文字包含转义引号（如@"I'm plain string. I'm \"qouted\" string ". 任何线索将不胜感激。

score 2 · Accepted Answer

这是 Freidl 的经典“展开循环”的一个用例：（编辑固定分组以进行捕获）

/"((?:[^"\\]|\\.)*)"/

这将匹配引用的字符串，同时考虑反斜杠转义的引号。

您将用于匹配字段（包括@）的完整正则表达式将是：

/@"((?:[^"\\]|\\.)*)"/

不过要小心！我经常看到人们抱怨这种模式在 PHP 中不起作用，这是因为在字符串中使用反斜杠有点让人心烦意乱。

上述模式中的反斜杠表示需要传递给 PCRE的文字反斜杠。这意味着在 PHP 字符串中使用它们时需要对它们进行双重转义：

$expr = '/@"((?:[^"\\\\]|\\\\.)*)"/';

preg_match_all($expr, $subject, $matches);

print_r($matches[1]); // this will show the content of all the matched fields

看到它工作

它是如何工作的？

...我听到你问了。好吧，让我们看看我是否可以用一种真正有意义的方式来解释这一点。让我们启用x模式，以便我们可以将其间隔一点：

/
  @             # literal @
  "             # literal "
    (           # start capture group, we want everything between the quotes
      (?:       # start non-capturing group (a group we can safely repeat)
        [^"\\]  # match any character that's not a " or a \
        |       # ...or...
        \\.     # a literal \ followed by any character
      )*        # close non-capturing group and allow zero or more occurrences
    )           # close the capture group
  "             # literal "
/x

这非常重要的几点是：

[^"\\]|\\.- 表示每个反斜杠都是“平衡的” - 每个反斜杠都必须转义一个字符，并且不会多次考虑任何字符。
将上述内容包装在一个*重复的组中意味着上述模式可以无限次出现，并且允许空字符串（如果您不想允许空字符串，请将更改*为 a +）。这是“展开循环”的“循环”部分。

但是输出字符串仍然包含转义引号的反斜杠？

确实如此，这只是一个匹配过程，它不会修改匹配。但是因为结果是字符串的内容，一个简单的str_replace('\\"', '"', $result)将是安全的并产生正确的结果。

然而，在做这种事情时，我经常发现我也想处理其他转义序列——在这种情况下，我通常会对结果做这样的事情：

 preg_replace_callback('/\\./', function($match) {
     switch ($match[0][1]) { // inspect the escaped character
         case 'r':
             return "\r";

         case 'n':
             return "\n";

         case 't':
             return "\t";

         case '\\':
             return '\\';

         case '"':
             return '"';

         default: // if it's not a valid escape sequence, treat the \ as literal
             return $match[0];
     }
 }, $result);

这给出了与 PHP 中的双引号字符串类似的行为，其中\t被替换为制表符，\n被替换为换行符等等。

如果我也想允许单引号字符串怎么办？

这困扰了我很长时间。我一直有一种琐碎的感觉，认为这可以通过反向引用更有效地处理，但许多尝试都未能产生任何可行的结果。

我这样做：

/(?:"((?:[^"\\]|\\.)*)")|(?:'((?:[^'\\]|\\.)*)')/

如您所见，这基本上只是两次应用基本相同的模式，具有 OR 关系。这也使 PHP 端的字符串提取变得非常复杂：

$expr = '/(?:"((?:[^"\\\\]|\\\\.)*)")|(?:\'((?:[^\'\\\\]|\\\\.)*)\')/';

preg_match_all($expr, $subject, $matches);

$result = array();
for ($i = 0; isset($matches[0][$i]); $i++) {
    if ($matches[1][$i] !== '') {
        $result[] = $matches[1][$i];
    } else {
        $result[] = $matches[2][$i];
    }
}

print_r($result);

score 0 · Accepted Answer

您需要使用否定的lookbehind - 匹配所有内容，直到找到前面没有反斜杠的引号。这是在java中：

public static void main(String[] args) {
    final String[] strings = new String[]{"@\"I'm a string\"", "@\"I'm plain string. I'm \\\"qouted\\\" \""};

    final Pattern p = Pattern.compile("@\"(.*)\"(?<!\\\\)");
    System.out.println(p.pattern());

    for (final String string : strings) {
        final Matcher matcher = p.matcher(string);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}

输出：

I'm a string
I'm plain string. I'm \"qouted\"

模式（没有所有 Java 转义）是：@"(.*)"(?<!\\)

php - 正则表达式的帮助

2 回答 2

Related

Reference