好的,让我们看看这里发生了什么。幸运的是,您已经将问题简化为应用正则表达式时发生的情况
(d+)*@
到字符串
ddddd
这显然无法匹配,因为@
缺少 。但是是什么让正则表达式引擎无法快速解决这个问题?
Well, (d+)*
in essence can be fulfilled by the following matches (each group separated by spaces):
ddddd
dddd d
ddd dd
dd ddd
d dddd
ddd d d
dd d dd
d d ddd
dd dd d
d dd dd
d ddd d
d d d dd
d d dd d
d dd d d
dd d d d
d d d d d
So you have one way of matching the string without breaking up the string, four ways to break it up into two strings, six ways to break it up into three, four to break it into four, and one to break it up into five strings. All these combinations have to be checked by the regex engine before it can declare failure to match when it finally has to face the following @
.
Why doesn't it figure that out sooner? Well, some regex engines can probably do it with such a simplified example. I bet Larry Wall has that covered. But your actual regex is a bit more complex, so I guess that would be much harder to figure out beforehand. Plus, there is a simple way to ensure all this combination-trying doesn't happen. But I'll come back to that later.
I had been wondering how many such combinations there would be for a longer string than those puny five d
s:
Let's take a string of length m
(which can be cut apart in m-1
different positions). Let's say n = m-1
. Then you can calculate the number of combinations as follows:
1 + (n!/(1! * (n-1)!)) + (n!/(2! * (n-2)!)) + ... + (n!/(n! * (n-n)!))
The first and last items are always 1, but the items in-between can get quite large. Let's write a small Python program:
>>> from math import factorial as f
>>> def comb(n,x):
... return f(n) // (f(x) * f(n-x))
...
>>> def ways_to_split(len):
... return 1 + sum(comb(len-1,x) for x in range(1,len))
...
>>> ways_to_split(5)
16
OK, seems to work. Let's try something bigger:
>>> ways_to_split(10)
512
>>> ways_to_split(40)
549755813888
>>> ways_to_split(100)
633825300114114700748351602688
Hey, there's a pattern here: ways_to_split(n)
is equal to 2**(n-1)
. See Mathematics SE for a proof. Anyway. Your regex has O(2^n)
complexity. Well, now you see why that might take a while...
Thankfully, many regex engines provide protection against this: possessive quantifiers or atomic capturing groups.
(d++)*
or
(?>d+)*
both ensure that whatever d+
has matched will not be relinquished again for trying other combinations. Good news, right?
Well, not if you use JavaScript. It doesn't support either of those features.
Therefore, you either need to rewrite your regex not to be susceptible to backtracking like this:
Instead of
(([_\.\-]?[a-zA-Z0-9]+)*)
use
[a-zA-Z0-9]+([_.-][a-zA-Z0-9]+)*
Or stop trying to use a regex to validate an e-mail address which doesn't work reliably, ever, anyway.