I'm trying to remove every Unicode character in a string if it falls in any the ranges below.
\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF
As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace
function.
var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
In this case, the character seems to have been replaced fine.
However, when I replace that with
var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
I see something unexpected. My output shows up as:
he�llo worl᷿fd is replaced with
There are two things to note here:
\u1dfff
does not show up as one character -\u1dff
gets converted to a character and thef
at the end it treated as its own character- the result is an empty string.
Any suggestions on how I can accomplish this would be much appreciated.
EDIT
My overall goal is to filter out all characters that the encodeURIComponent
function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff
to a unicode character before passing that to encodeURIComponent
causes an exception to be raised by the latter function.
var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);
I edited parts of the question after @Blender pointed out that i was using x
instead of u
in my code to represent Unicode characters.
EDIT 2
I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode
a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.