regex - Perl，动态生成的带有反斜杠元字符的正则表达式字符串奇怪的行为

Question

这是一个小的 perl 片段：

my $n = 1;
my $re_str = '\d';
my $re_str_q = '\Q1\E';

printf "re_str   match: %s\n", ($n =~ /$re_str/);
printf "re_str_q match: %s\n", ($n =~ /$re_str_q/);
printf "direct match: %s\n", ($n =~ /\Q1\E/);

运行时产生以下输出：

re_str   match: 1
re_str_q match: 
direct match: 1

所以，我的问题是为什么第二个printf不匹配？

score 9 · Accepted Answer

如果你改变

my $re_str_q = '\Q1\E'; #from 
my $re_str_q = qr/\Q1\E/; #to

这将是传递动态生成的正则表达式的正确方法，那么它将给出以下结果

re_str   match: 1
re_str_q match: 1
direct match: 1

另外，如果您使用过

use strict;
use warnings;

你会收到警告

Unrecognized escape \Q passed through in regex; marked by <-- HERE in m/\Q <-- HERE 1\E/ at so.pl line 9.
Unrecognized escape \E passed through in regex; marked by <-- HERE in m/\Q1\E <-- HERE / at so.pl line 9.

这会给你一些关于出了什么问题的迹象。

更新

要更详细地了解这一点，您可以从此处阅读

从参考文件中取出

以下转义序列在 . 的构造中可用interpolate，但在transliterations.

\l  lowercase next character only
\u  titlecase (not uppercase!) next character only
\L  lowercase all characters till \E or end of string
\U  uppercase all characters till \E or end of string
\F  foldcase all characters till \E or end of string
\Q quote (disable) pattern metacharacters till \E or
end of string
\E  end either case modification or quoted section
(whichever was last seen)

请参阅quotemeta以了解 \Q 引用的字符的确切定义。

\L 、 \U 、 \F 和 \Q 可以堆叠，在这种情况下，您需要一个 \E 。例如：

say"This \Qquoting \ubusiness \Uhere isn't quite\E done yet,\E is it?";
This quoting\ Business\ HERE\ ISN\'T\ QUITE\ done\ yet\, is it?

score 3 · Accepted Answer

您正在使用单引号来构建“动态生成的正则表达式”。用use warningsperl会告诉你：

无法识别的转义 \Q 在正则表达式中通过；标记为 <-- HERE in m/\Q <-- HERE 1\E/ at ...

perldoc perlop会告诉你：

单引号

单引号表示文本将按字面意思进行处理，不对其内容进行插值。这类似于单引号字符串，除了反斜杠没有特殊含义，“\\”被视为两个反斜杠，而不是像在其他所有引用构造中那样。

这是 perl 中唯一无需担心转义内容的引用形式，代码生成器可以并且确实很好地利用了这一点。

score 3 · Accepted Answer

\Q 不是正则表达式转义，它是在字符串中替换的字符串转义，因此“\Q1\E”将等效于 quotemeta('1')。

因此，您需要使用可以插入这些序列的引号，例如“”或 qr//，或者调用 quotemeta 而不是尝试使用字符串转义。

score 2 · Accepted Answer

\Q 和 \E 都由字符串插值处理，而不是正则表达式引擎。您绕过第一行和第二printf行中的字符串插值。在Programming Perl中搜索“七种翻译转义”以对此进行讨论。他们是：\N{...} \U \u \L \l \E \Q \F。（不要问我为什么有八个。）

score 2 · Accepted Answer

正则表达式涉及两种转义。

解析文字时处理的转义。
由正则表达式引擎处理的转义。

例如，

\Q..\E是前一种。
\d是后一种。
\n都是善良的。

这意味着

"abc\Q!@#\Edef"产生 12 个字符的字符串abc\!\@\#def。
qq/abc\Q!@#\Edef/完全一样。
qr/abc\Q!@#\Edef/生成 12 个字符的字符串abc\!\@\#def，然后将其编译为匹配 9 个字符的正则表达式模式abc!@#def。

但是单引号不会处理这种转义，所以

'abc\Q!@#\Edef'产生 13 个字符的字符串abc\Q!@#\Edef。

正则表达式引擎不理解\Qor \E，因此如果您最终将最后一个字符串传递给它，它会发出警告，然后它会尝试匹配 11 个 chars abcQ!@#Edef。

解决方法是改变

my $re_str   = '\d';         # Produces string \d
my $re_str_q = '\Q1\E\';     # Produces string \Q1\E

至

my $re_str   = "\\d";        # Produces string \d
my $re_str_q = "\Q1\E";      # Produces string \1

或者更好，

my $re_str   = qr/\d/;       # Produces regex \d
my $re_str_q = qr/\Q1\E/;    # Produces regex \1

阅读更多关于\具体的信息。

regex - Perl，动态生成的带有反斜杠元字符的正则表达式字符串奇怪的行为

5 回答 5

Related

Reference