java - String.codePointAt 到底是做什么的？

Question

最近我在Java中遇到了String的codePointAt方法。我还发现了其他一些codePoint方法：codePointBefore等codePointCount。它们肯定与 Unicode 有关，但我不明白。

现在我想知道何时以及如何使用codePointAt类似的方法。

score 52 · Accepted Answer

简短的回答：它为您提供从String. 即该位置的字符的“unicode number”。

更长的答案： Java 是在 16 位（又名 a char）足以容纳任何存在的 Unicode 字符时创建的（这些部分现在称为基本多语言平面或 BMP）。后来，Unicode 被扩展为包含代码点 > 2 ¹⁶的字符。这意味着 achar不能再保存所有可能的 Unicode 代码点。

UTF-16是解决方案：它将“旧”Unicode 代码点存储在 16 位（即恰好一个char）和所有新的 32 位（即两个char值）中。这两个 16 位值称为“代理对”。现在严格来说 achar拥有一个“UTF-16 代码单元”，而不是过去的“一个 Unicode 字符”。

char现在，只要您不使用任何“新”Unicode 字符（或不真正关心它们），所有“旧”方法（仅处理）都可以使用，但如果您关心新字符（或者只需要完整的 Unicode 支持），那么您将需要使用实际支持所有可能的 Unicode 代码点的“代码点”版本。

注意：一个非常著名的不在 BMP 中的 unicode 字符示例（即仅在使用代码点变体时有效）是表情符号：即使是简单的Grinning Face U+1F600 也不能用单个char.

score 6 · Accepted Answer

代码点支持大于 65535 的字符，即 Character.MAX_VALUE。

如果您有如此高字符的文本，则必须使用代码点或int代替chars。

它不是通过支持 UTF-16 来实现的，它可以使用一个或两个 16 位字符并将其转换为int

AFAIK，通常这仅适用于最近添加的补充多语言和补充表意字符，例如非繁体中文。

score 1 · Accepted Answer

下面的代码示例有助于阐明codePointAt

    String myStr = "13";
    System.out.println(myStr.length()); // print 4, because  is two char
    System.out.println(myStr.codePointCount(0, myStr.length())); //print 3, factor in all unicode
    
    int result = myStr.codePointAt(0);
    System.out.println(Character.toChars(result)); // print 1
    
    result = myStr.codePointAt(1);
    System.out.println(Character.toChars(result)); // print , because codePointAt will get surrogate pair (high and low)
    
    result = myStr.codePointAt(2);
    System.out.println(Character.toChars(result)); // print low surrogate of  only, in this case it show "?"
    
    result = myStr.codePointAt(3);
    System.out.println(Character.toChars(result)); // print 3

score 0 · Accepted Answer

简而言之，只要您在 Java 中使用默认字符集就很少 :) 但要获得更详细的解释，请尝试以下帖子：

将字符与代码点进行比较？ http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html http://javarevisited.blogspot.com/2012/01/java-string-codepoint-get-unicode .html

希望这有助于为您澄清事情:)

java - String.codePointAt 到底是做什么的？

4 回答 4

Related

Reference