I have an R "plugin" that reads a bunch of lines from stdin, parses it and evaluates it.
...
code <- readLines(f, warn=F) ## that's where the lines come from...
result <- eval(parse(text=code))
...
Now, sometimes the system that provides the lines of code kindly inserts a UTF-8 non-break space (U+00A0
= \xc2\xa0
) here and there in the code. The parse()
chokes on such characters. Example:
s <- "1 +\xc2\xa03"
s
[1] "1 + 3" ## looks fine doesn't it? In fact, the Unicode "NON-BREAK SPACE" is there
eval(parse(text=s))
Error in parse(text = s) : <text>:1:4: unexpected input
1: 1 +?
^
eval(parse(text=gsub("\xc2\xa0"," ",s)))
[1] 4
I would like to replace that character with a regular space, and can do so (but at my own peril, I guess) as above with this:
code <- gsub('\xc2\xa0',' ',code)
However, this is not clean as the byte sequence '\xc2\a0'
could conceivably start matching in the middle of another 2-byte char whose 2nd byte is 0xc2
.
Perhaps a bit better, we can say:
code <- gsub(intToUtf8(0x00a0L),' ',code)
But this would not generalize to a UTF-8 string.
Surely there is a better, more expressive way to enter a string containing some UTF-8 characters? In general, what's the right way to express a UTF-8 string (here, the pattern argument of sub()
)?
Edit: to be clear, I am interested in entering UTF-8 chars in a String by specifying their hexadecimal value. Consider the following example (note that "é"
is Unicode U+00E9
and can be expressed in UTF-8 as 0xc3a9
):
s <- "Cet été."
gsub("té","__",s)
# --> "Cet é__."
# works, but I like to keep my code itself free of UTF-8 literals,
# plus, for the initial question, I really don't want to enter an actual
# UTF-8 "NON BREAKABLE SPACE" in my code as it would be undistinguishable
# from a regular space.
gsub("t\xc3\xa9","__",s) ## works, but I question how standard and portable
# --> "Cet é__."
gsub("t\\xc3\\xa9","__",s) ## doesn't work
# --> "Cet été."
gsub("t\x{c3a9}","__",s) ## would work in Perl, doesn't seem to work in R
# Error: '\x' used without hex digits in character string starting "s\x"