2

I have an R "plugin" that reads a bunch of lines from stdin, parses it and evaluates it.

...
code <- readLines(f, warn=F)   ## that's where the lines come from...
result <- eval(parse(text=code))
...

Now, sometimes the system that provides the lines of code kindly inserts a UTF-8 non-break space (U+00A0 = \xc2\xa0) here and there in the code. The parse() chokes on such characters. Example:

s <- "1 +\xc2\xa03"
s
[1] "1 + 3"   ## looks fine doesn't it? In fact, the Unicode "NON-BREAK SPACE" is there

eval(parse(text=s))
Error in parse(text = s) : <text>:1:4: unexpected input
1: 1 +?
      ^

eval(parse(text=gsub("\xc2\xa0"," ",s)))
[1] 4

I would like to replace that character with a regular space, and can do so (but at my own peril, I guess) as above with this:

code <- gsub('\xc2\xa0',' ',code)

However, this is not clean as the byte sequence '\xc2\a0' could conceivably start matching in the middle of another 2-byte char whose 2nd byte is 0xc2.

Perhaps a bit better, we can say:

code <- gsub(intToUtf8(0x00a0L),' ',code)

But this would not generalize to a UTF-8 string.

Surely there is a better, more expressive way to enter a string containing some UTF-8 characters? In general, what's the right way to express a UTF-8 string (here, the pattern argument of sub())?


Edit: to be clear, I am interested in entering UTF-8 chars in a String by specifying their hexadecimal value. Consider the following example (note that "é" is Unicode U+00E9 and can be expressed in UTF-8 as 0xc3a9):

s <- "Cet été."
gsub("té","__",s)
# --> "Cet é__."
# works, but I like to keep my code itself free of UTF-8 literals,
# plus, for the initial question, I really don't want to enter an actual
# UTF-8 "NON BREAKABLE SPACE" in my code as it would be undistinguishable
# from a regular space.

gsub("t\xc3\xa9","__",s)  ## works, but I question how standard and portable
# --> "Cet é__."

gsub("t\\xc3\\xa9","__",s)  ## doesn't work
# --> "Cet été."

gsub("t\x{c3a9}","__",s)  ## would work in Perl, doesn't seem to work in R
# Error: '\x' used without hex digits in character string starting "s\x"
4

1 回答 1

2

(之前的东西被删了。)

编辑2:

> s <- '\U00A0'
> s
[1] " "
> code <- gsub(s, '__','\xc2\xa0' )
> code
[1] "__"
于 2013-02-06T02:39:34.917 回答