最初 Java 通过将char类型设为 16 位长来支持 Unicode 1.0,但 Unicode 2.0 引入了代理字符机制来支持比 16 位允许的数量更多的字符,因此 Java 字符串变成了 UTF-16 编码;这意味着有些字符需要两个 Java 字符来表示,它们被称为高代理字符和低代理字符。
要知道字符串中的哪些字符实际上是高/低代理对,您可以使用以下实用程序方法Character
:
Character.isHighSurrogate(myChar); // returns true if myChar is a high surrogate
Character.isLowSurrogate(myChar); // same for low surrogate
Character.isSurrogate(myChar); // just to know if myChar is a surrogate
一旦您知道哪些字符是高或低代理,您需要使用以下方法将每一对转换为 unicode 代码点:
int codePoint = Character.toCodePoint(highSurrogate, lowSurrogate);
由于一段代码值一千字,这是一个示例方法,用于将字符串中的非 us-ascii 字符替换为 xml 字符引用:
public static String replaceToCharEntities(String str) {
StringBuilder result = new StringBuilder(str.length());
char surrogate = 0;
for(char c: str.toCharArray()) {
// if char is a high surrogate, keep it to match it
// against the next char (low surrogate)
if(Character.isHighSurrogate(c)) {
surrogate = c;
continue;
}
// get codePoint
int codePoint;
if(surrogate != 0) {
codePoint = Character.toCodePoint(surrogate, c);
surrogate = 0;
} else {
codePoint = c;
}
// decide wether using just a char or a character reference
if(codePoint < 0x20 || codePoint > 0x7E || codePoint == '<'
|| codePoint == '>' || codePoint == '&' || codePoint == '"'
|| codePoint == '\'') {
result.append(String.format("&#x%x;", codePoint));
} else {
result.append(c);
}
}
return result.toString();
}
下一个字符串示例是一个很好的测试示例,因为它包含一个可以用 16 位值表示的非 ascii 字符以及一个具有高/低代理对的字符:
String myString = "text with some non-US chars: 'Ñ' and ''";