4

I'm curious on the algorithm for deciding which characters to include, in a regex when using a -...

Example: [a-zA-Z0-9]

This matches any character of any case, a through z, and numbers 0 through 9.

I had originally thought that they were used sort of like macros, for example, a-z translates to a,b,c,d,e etc.. but after I saw the following in an open source project,

text.tr('A-Za-z1-90', 'Ⓐ-Ⓩⓐ-ⓩ①-⑨⓪')

my paradigm on regex's has changed entirely, because these are characters that are not your typical characters, so how the heck did this work correctly, i thought to myself.

My theory is that the - literally means

Any ASCII value between the left character, and the right character. (e.g. a-z [97-122])

Could anybody confirm if my theory is correct? Does the regex pattern in-fact calculate using the character codes, between any character?

Furthermore, if it IS correct, could you perform a regex match like,

A-z

because A is 65, and z is 122 so theoretically, it should also match all characters between those values.

4

2 回答 2

4

Both of your assumptions are correct. (therefore, technically you could do [#-~] and it would still be valid, capturing uppercase letters, lowercase letters, numbers, and certain symbols.)

ASCII Table

You can also do this with Unicode, like [\u0000-\u1000].

You should not do [A-z], however, because there are some characters between the uppercase and lowercase letters (specifically [, \, ], ^, _, `).

于 2013-06-21T20:02:30.007 回答
4

From MSDN - Character Classes in Regular Expressions (bold is mine):

The syntax for specifying a range of characters is as follows:

[firstCharacter-lastCharacter]

where firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points.

So your assumption is correct, but the effect is, in fact, wider: Unicode character codes, not just ASCII.

于 2013-06-21T20:05:29.417 回答