URI 中的序列%XX
编码一个“八位字节”,即一个八位字节。这就提出了解码字节指的是什么 Unicode 字符的问题。如果我没记错的话,在旧版本的 URI 规范中,并没有很好地定义假定的字符集。在更高版本的 URI 规范中,建议将 UTF-8 作为默认编码字符集。也就是说,要解码字节序列,您需要解码每个%XX
序列,然后使用 UTF-8 字符集将生成的字节转换为字符串。
这解释了为什么%96
不会解码。十六进制 0x96 值不是有效的 UTF-8 序列。由于它位于 ASCII 之外,因此在它之前需要一个特殊的修饰符字节来指示扩展字符。(有关更多详细信息,请参阅 UTF-8 规范。)JavaScriptencodeURIComponent()
和decodeURIComponent()
方法都假定 UTF-8(因为它们应该),所以我不希望%96
正确解码。
The character you referenced is U+2013, an en-dash. How on earth does the page you reference get an en-dash from hex 0x96 (decimal 150)? They are obviously not assuming UTF-8 encoding, which is the standard. They are not assuming ASCII, which doesn't contain this character. They are not even assuming ISO-8859-1, which is a standard encoding that uses one byte per character. It turns out they are assuming the special Windows 1252 code page. That is, the URI yo u are trying to decode assumes that the user is on a Windows machine, and even worse, on a Windows machine in English (or one of a few other Western languages).
In short, the table you're using is bad. It's out-of-date and assumes that the user is on an English Windows system. The up-to-date and correct way to encode non-ASCII values is to convert them to UTF-8 and then encode each octet using %XX
. That's why you got %E2%80%93
when you tried to encode the character, and that's what decodeURIComponent()
is expecting. The URI you're using is not encoded correctly. If you have no other choice, you can guess that the URI is using Windows 1252, convert the bytes yourself, and then use a Windows 1252 table to find out what Unicode values were intended. But that's risky---how do you know which URI uses which table? That's why everybody settled on UTF-8. If possible, tell whoever is giving you these URIs to encode them correctly.