6

我有以下正则表达式来验证电子邮件地址:

^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$

在基本电子邮件上运行它可以很好地工作:

/^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$/.test('dave@the-taylors.org');

但是在长字符串上运行它会使 Chrome 崩溃:

/^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$/.test('dddddddddddddddddddddddddddddddddddddddd');

我注意到它大约有 40 个字符

这么密集的正则表达式是怎么回事?

4

4 回答 4

7

好的,让我们看看这里发生了什么。幸运的是,您已经将问题简化为应用正则表达式时发生的情况

(d+)*@

到字符串

ddddd

这显然无法匹配,因为@缺少 。但是是什么让正则表达式引擎无法快速解决这个问题?

Well, (d+)* in essence can be fulfilled by the following matches (each group separated by spaces):

ddddd
dddd d
ddd dd
dd ddd
d dddd
ddd d d
dd d dd
d d ddd
dd dd d
d dd dd
d ddd d
d d d dd
d d dd d
d dd d d
dd d d d
d d d d d

So you have one way of matching the string without breaking up the string, four ways to break it up into two strings, six ways to break it up into three, four to break it into four, and one to break it up into five strings. All these combinations have to be checked by the regex engine before it can declare failure to match when it finally has to face the following @.

Why doesn't it figure that out sooner? Well, some regex engines can probably do it with such a simplified example. I bet Larry Wall has that covered. But your actual regex is a bit more complex, so I guess that would be much harder to figure out beforehand. Plus, there is a simple way to ensure all this combination-trying doesn't happen. But I'll come back to that later.

I had been wondering how many such combinations there would be for a longer string than those puny five ds:

Let's take a string of length m (which can be cut apart in m-1 different positions). Let's say n = m-1. Then you can calculate the number of combinations as follows:

1 + (n!/(1! * (n-1)!)) + (n!/(2! * (n-2)!)) + ... + (n!/(n! * (n-n)!))

The first and last items are always 1, but the items in-between can get quite large. Let's write a small Python program:

>>> from math import factorial as f
>>> def comb(n,x):
...    return f(n) // (f(x) * f(n-x))
...
>>> def ways_to_split(len):
...    return 1 + sum(comb(len-1,x) for x in range(1,len))
...
>>> ways_to_split(5)
16

OK, seems to work. Let's try something bigger:

>>> ways_to_split(10)
512
>>> ways_to_split(40)
549755813888
>>> ways_to_split(100)
633825300114114700748351602688

Hey, there's a pattern here: ways_to_split(n) is equal to 2**(n-1). See Mathematics SE for a proof. Anyway. Your regex has O(2^n) complexity. Well, now you see why that might take a while...

Thankfully, many regex engines provide protection against this: possessive quantifiers or atomic capturing groups.

(d++)*

or

(?>d+)*

both ensure that whatever d+ has matched will not be relinquished again for trying other combinations. Good news, right?

Well, not if you use JavaScript. It doesn't support either of those features.

Therefore, you either need to rewrite your regex not to be susceptible to backtracking like this:

Instead of

(([_\.\-]?[a-zA-Z0-9]+)*)

use

[a-zA-Z0-9]+([_.-][a-zA-Z0-9]+)*

Or stop trying to use a regex to validate an e-mail address which doesn't work reliably, ever, anyway.

于 2012-10-09T22:00:27.650 回答
2

我认为这与您的正则表达式有关,而不是与字符串的长度有关。我使用以下正则表达式进行电子邮件验证,它没有在 Chrome 上崩溃..

/^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/.test('dddddddddddddddddddddddddddddddddddddddd');
于 2012-10-09T16:05:06.247 回答
2

不要使用正则表达式验证电子邮件。我认为这是大约二十年来的常识。它太复杂了。大部分完整验证的示例在这里http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html甚至没有完全实现标准。编写一个功能相同的函数要简单得多。当您可以单独验证电子邮件的某些部分时,它变得微不足道。此外,就像您的情况一样,该功能会更快地执行此操作。

于 2012-10-09T17:59:08.210 回答
2

The root of the problem is here:

[_.-]?

If the first [A-Za-z0-9]+ (you left out the +, by the way) runs out of alphanumeric characters to consume, and the next character is not one of the separator characters ( [_.-] ), the match attempt should fail immediately.

What happens with your regex is that the first [A-Za-z0-9]+ starts backing off, giving up the characters it just matched one by one, and letting the second [A-Za-z0-9]+ (inside the * loop) try matching them instead. That's a lot of combinations it has to try (as Tim's thesis answer explains ☺), and it's all perfectly pointless.

Now have a look at this:

^[A-Za-z0-9]+([_.-][a-zA-Z0-9]+)*@[A-Za-z0-9]+([.-][a-zA-Z0-9]+)*\.[A-Za-z]{2,}$

This regex doesn't enter the * loop unless it actually sees one of the separator characters. The subexpression inside the * may be optional, but whatever it does match must start with _, ., or -. Similarly, if the regex succeeds in matching the local part and the next character is not @, it bails out immediately, where yours goes into another paroxysm of backtracking.

According to RegexBuddy, your regex takes 57 steps to match dave@the-taylors.org, where mine does it in 17 steps. And where yours locks up on the other string, mine reports a failed match in 5 steps.

The moral is: never use a ? or * quantifier on something that isn't really optional.


Disclaimer: I'm not endorsing the use of this or any other regex to match email addresses. I'm just explaining what's wrong with the regex in the question.

于 2012-10-10T04:18:03.560 回答