javascript - JavaScript 中的 Unicode 和 URI 编码、解码和转义

Question

如果您在此处查看此表，它有一个实际上对我不起作用的 Unicode 字符的转义序列列表。

例如对于“%96”，它应该是 -，尝试解码时出现错误：

decodeURIComponent("%96");
URIError: URI malformed

如果我尝试编码“-”，我实际上得到：

encodeURIComponent("–");
"%E2%80%93"

我在互联网上搜索并看到了这个页面，其中提到分别使用 escape 和 unescape 与 decodeURIComponent 和 encodeURIComponent 。这似乎无济于事，因为无论我尝试什么， %96 都不会显示为“–”，这当然行不通：

decodeURIComponent(escape("%96));
"%96"

不是很有帮助。

如何使用 JavaScript 让“%96”成为“–”（无需为我可能遇到的每个可能的 unicode 字符硬编码映射）？

score 4 · Accepted Answer

URI 中的序列%XX编码一个“八位字节”，即一个八位字节。这就提出了解码字节指的是什么 Unicode 字符的问题。如果我没记错的话，在旧版本的 URI 规范中，并没有很好地定义假定的字符集。在更高版本的 URI 规范中，建议将 UTF-8 作为默认编码字符集。也就是说，要解码字节序列，您需要解码每个%XX序列，然后使用 UTF-8 字符集将生成的字节转换为字符串。

这解释了为什么%96不会解码。十六进制 0x96 值不是有效的 UTF-8 序列。由于它位于 ASCII 之外，因此在它之前需要一个特殊的修饰符字节来指示扩展字符。（有关更多详细信息，请参阅 UTF-8 规范。）JavaScriptencodeURIComponent()和decodeURIComponent()方法都假定 UTF-8（因为它们应该），所以我不希望%96正确解码。

The character you referenced is U+2013, an en-dash. How on earth does the page you reference get an en-dash from hex 0x96 (decimal 150)? They are obviously not assuming UTF-8 encoding, which is the standard. They are not assuming ASCII, which doesn't contain this character. They are not even assuming ISO-8859-1, which is a standard encoding that uses one byte per character. It turns out they are assuming the special Windows 1252 code page. That is, the URI yo u are trying to decode assumes that the user is on a Windows machine, and even worse, on a Windows machine in English (or one of a few other Western languages).

In short, the table you're using is bad. It's out-of-date and assumes that the user is on an English Windows system. The up-to-date and correct way to encode non-ASCII values is to convert them to UTF-8 and then encode each octet using %XX. That's why you got %E2%80%93 when you tried to encode the character, and that's what decodeURIComponent() is expecting. The URI you're using is not encoded correctly. If you have no other choice, you can guess that the URI is using Windows 1252, convert the bytes yourself, and then use a Windows 1252 table to find out what Unicode values were intended. But that's risky---how do you know which URI uses which table? That's why everybody settled on UTF-8. If possible, tell whoever is giving you these URIs to encode them correctly.

score 3 · Accepted Answer

作为社区 wiki 条目发布，因为它来自 Carl Henderson 的“构建可扩展网站”。这本书说，尽管重现示例的重要部分是可以的。您可以使用它为“-”创建一个特殊情况。

function escape_utf8(data) {
        if (data == '' || data == null){
               return '';
        }
       data = data.toString();
       var buffer = '';
       for(var i=0; i<data.length; i++){
               var c = data.charCodeAt(i);
               var bs = new Array();
              if (c > 0x10000){
                       // 4 bytes
                       bs[0] = 0xF0 | ((c & 0x1C0000) >>> 18);
                       bs[1] = 0x80 | ((c & 0x3F000) >>> 12);
                       bs[2] = 0x80 | ((c & 0xFC0) >>> 6);
                   bs[3] = 0x80 | (c & 0x3F);
               }else if (c > 0x800){
                        // 3 bytes
                        bs[0] = 0xE0 | ((c & 0xF000) >>> 12);
                        bs[1] = 0x80 | ((c & 0xFC0) >>> 6);
                       bs[2] = 0x80 | (c & 0x3F);
             }else if (c > 0x80){
                      // 2 bytes
                       bs[0] = 0xC0 | ((c & 0x7C0) >>> 6);
                      bs[1] = 0x80 | (c & 0x3F);
               }else{
                       // 1 byte
                    bs[0] = c;
              }
             for(var j=0; j<bs.length; j++){
                      var b = bs[j];
                       var hex = nibble_to_hex((b & 0xF0) >>> 4) 
                      + nibble_to_hex(b &0x0F);buffer += '%'+hex;
              }
    }
    return buffer;
}
function nibble_to_hex(nibble){
        var chars = '0123456789ABCDEF';
        return chars.charAt(nibble);
}

score 1 · Accepted Answer

看到这个问题，特别是这个答案：

有一种特殊的“%uNNNN”格式用于编码 Unicode UTF-16 代码点，而不是编码 UTF-8 字节

我怀疑“–”是这些字符之一，因为Ascii 表中的 0x96是 û

javascript - JavaScript 中的 Unicode 和 URI 编码、解码和转义

3 回答 3

Related

Reference