2

In a Ruby script, I'm using string#gsub to generate a string that is used as a regex. This regex has to match against a + character, so I'm using \+ to escape it.

This example code isolates my source of confusion. In this code, the regex I want to create is /a\+b/. However, when I use #gsub, the regex that is returned is /ab/.

string = 'a\+b'
expected = Regexp.new(string)
actual = Regexp.new('x'.gsub('x', string))

# expected returns /a\+b/
# actual returns /ab/

I couldn't find anything in the Ruby documentation about #gsub and + characters. Can anybody help me understand what is happening to produce this result?

For now, to make my code work, I'm matching against \x2B, the ANSI hex code for the + character. Is there a way to achieve this that isn't so obfuscated?

Thanks in advance!

4

4 回答 4

3

Let’s ignore the Regexp.new here, as it’s not really relevant—only the gsub itself is.

Your \+ is being interpreted as a back-reference by gsub. From the docs:

If replacement is a String it will be substituted for the matched text. It may contain back-references to the pattern’s capture groups of the form \\d, where d is a group number, or \\k<n>, where n is a group name. If it is a double-quoted string, both back-references must be preceded by an additional backslash. However, within replacement the special match variables, such as $&, will not refer to the current match.

While it’s not very clear (since the docs say “group number”), the \+ is substituted for the global variable $+*; from Ruby Quickref:

$+: Depends on $~. The highest group matched by the last successful match.

We can prove this by capturing something:

'x'.gsub(/(x)/, 'a\+b')  #=> "axb"

Which shows that the \+ is being replaced with the capture from the regex. Since you have no captures in your pattern (as it is a string), the back-reference is replaced with empty string, and you get "ab" as the result of the gsub.

Using "a\+b" works as it’s not actually a \+ in there:

"a\+b".bytes  #=> [97, 43, 98]
'a\+b'.bytes  #=> [97, 92, 43, 98]

* Kind of, it’s semantically equivalent, but the match global variables themselves aren’t actually set until after the gsub finishes replacing—however the back-references are, of course, set before replacement occurs.

于 2013-05-25T18:24:18.740 回答
1

Inside a replacement string \+ is used to refer to the value of the last capturing group (so if the regex includes, for example, 3 capturing groups \+ is the same as \3). If you use the block form of gsub instead, these substitutions will not be performed:

string = 'a\+b'
actual = Regexp.new( 'x'.gsub('x') { string } )
# actual is now /a\+b/
于 2013-05-25T18:18:23.283 回答
0

The union method of Regexp is often used to create a regular expression from a combination of strings (and/or Regexps). Since it escapes these strings it is useful here too:

re = Regexp.union("a+b") # => /a\+b/ 
于 2013-05-25T18:18:59.977 回答
-1

Regexp.new will automatically handle +.

Try this:

string = 'a+b'
expected = Regexp.new(string)
actual = Regexp.new('x'.gsub('x', string))

Let me know if you meant something else

Another interpretation of your question led me to this:

string = 'a\\\+b'
expected = Regexp.new(string)
actual = Regexp.new('x'.gsub('x', string))
于 2013-05-25T17:56:22.817 回答