我有这个正则表达式
("[^"]*")|('[^']*')|([^<>]+)
当交给这个输入字符串时
<telerik:RadTab Text="RGB">
我希望它匹配RGB
。但是,由于最后一个替代方案会产生更长的字符串,因此不会。
我理想中想要的是这样的:
- 如果有双引号子字符串,匹配它,包括双引号。
- 否则,如果存在单引号子字符串,则匹配它,包括单引号。
- 否则,如果有一个字符串被尖括号包围,则匹配它,不包括尖括号。
这个逻辑可以在单个正则表达式中完成吗?
var strings = new[]
{"<telerik:RadTab Text=\"RGB\">", "<telerik:RadTab Text=RGB>", "<telerik:RadTab Text='RGB'>"};
var r = new Regex("<([^<\"']+[^>\"']+)>|(\"[^\"]*\")|('[^']*')");
foreach (var s1 in strings)
{
Console.WriteLine(s1);
var match = r.Match(s1);
Console.WriteLine(match.Value);
Console.WriteLine();
}
Console.ReadLine();
这个问题的解决方案之一是使用前瞻断言:
(?=("[^"]*"))|(?=('[^']*'))|(?=<([^<>]+)>)
让我们分解正则表达式以获得更好的视图:
(?= # zero-width assertion, look ahead if there is ...
("[^"]*") # a double quoted string, group it in group number 1
) # end of lookahead
| # or
(?= # zero-width assertion, look ahead if there is ...
('[^']*') # a single quoted string, group it in group number 2
) # end of lookahead
| # or
(?= # zero-width assertion, look ahead if there is ...
<([^<>]+)> # match anything except <> between <> one or more times and group it in group number 3
) # end of lookahead
你可能会想what in the world is he doing?
,没问题,我会进一步解释你的正则表达式失败的原因。
我们有以下字符串<telerik:RadTab Text="RGB">
:
<telerik:RadTab Text="RGB">
^ the regex engine starts here
since there is no match with ("[^"]*")|('[^']*')|([^<>]+)
it will look further !
<telerik:RadTab Text="RGB">
^ the regex engine will now take a look here
it will check if there is "[^"]*", well obviously there isn't
now since there is an alternation, the regex engine will
check if there is '[^']*', meh same thing
it will now check if there is [^<>]+, but hey it matches !
So your regex engine will "eat" it like so
<telerik:RadTab Text="RGB">
^^^^^^^^^^^^^^^^^^^^^^^^^ and match this, by eating I mean it's advancing
Now the regex engine is at this point
<telerik:RadTab Text="RGB">
^ and obviously, there is no match
The problem is, you want it to "step" back to match "RGB"
The regex engine won't go back for you :(
这就是为什么我们对组使用零宽度断言,它不会吃(不会前进),如果你在前瞻中使用一个组,你仍然会得到匹配的组。
<telerik:RadTab Text="RGB">
^ So when it comes here, it will match it with (?=<([^<>]+)>)
but it won't eat the whole matched string
Now obviously, the regex needs to continue to look for other matches
So it comes here:
<telerik:RadTab Text="RGB">
^ no match
<telerik:RadTab Text="RGB">
^ no match
.....
until
<telerik:RadTab Text="RGB">
^ hey there is a match using (?=("[^"]*"))
it will then advance further
<telerik:RadTab Text="RGB">
^ no match
.... until it reaches the end
当然,如果你有一个这样的字符串,<telerik:RadTab Text="RGB'lol'">
它仍然会匹配'lol'
双引号值并将其放在第 2 组中。
Online demo
正则表达式摇滚!
编辑:考虑以下正则表达式...
(\".*?\"|\'.*?\'|(?<=\<).*?(?=\>))