javascript - 如何在 google v8（和 nodejs）中呈现 32 位 unicode 字符

Question

有谁知道如何在 google v8（驱动 google chrome 和 nodejs 的 javascript vm）中呈现 unicode 'astral plane' 字符（其 CID 超过 0xffff）？

有趣的是，当我给谷歌浏览器（它标识为 11.0.696.71，在 ubuntu 10.4 上运行）一个这样的 html 页面时：

<script>document.write( "helo" )
document.write( " ⿸子" );
</script>

它会正确地渲染“宽”字符和“窄”字符，但是当我在 nodejs 中尝试等效（使用console.log()）时，我得到一个“宽”字符的 �（0xfffd，REPLACEMENT CHARACTER）。

我还被告知，出于任何不可理解的原因，谷歌决定使用 16 位宽的数据类型来实现字符。虽然我觉得这很愚蠢，但代理代码点的设计正是为了通过 16 位挑战路径实现“星体代码点”的“通道”。并且不知何故，在 chrome 11.0.696.71 中运行的 v8 似乎使用了这一点 unicode-foo 或其他魔法来完成它的工作（我似乎记得几年前我总是得到盒子而不是在静态页面上）。

啊，是的，node --version报告v0.4.10，必须弄清楚如何从中获取 v8 版本号。

更新我在咖啡脚本中做了以下操作：

a = String.fromCharCode( 0xd801 )
b = String.fromCharCode( 0xdc00 )
c = a + b
console.log a
console.log b
console.log c
console.log String.fromCharCode( 0xd835, 0xdc9c )

但这只会给我

���
���
������
������

这背后的想法是，既然处理 unicode 的 javascript 规范的脑残部分似乎是强制的？/ 不完全禁止？/ 允许？代理对的使用，那么也许我的源文件编码（utf-8）可能是问题的一部分。毕竟，在 utf-8 中编码 32 位代码点有两种方法：一种是写出第一个代理所需的 utf-8 八位字节，然后是第二个代理所需的 utf-8 八位字节；另一种方式（这是首选方式，根据 utf-8 规范）是计算结果代码点并写出该代码点所需的八位字节。所以在这里我完全排除了源文件编码的问题，只处理数字。上面的代码确实可以document.write()在 chrome 中使用，所以我知道我的数字是正确的。

叹。

编辑我做了一些实验，发现当我做的时候

var f = function( text ) {
  document.write( '<h1>',  text,                                '</h1>'  );
  document.write( '<div>', text.length,                         '</div>' );
  document.write( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
  document.write( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' );
  console.log( '<h1>',  text,                                 '</h1>'  );
  console.log( '<div>', text.length,                          '</div>' );
  console.log( '<div>0x', text.charCodeAt(0).toString( 16 ),  '</div>' );
  console.log( '<div>0x', text.charCodeAt(1).toString( 16 ),  '</div>' ); };

f( '' );
f( String.fromCharCode( 0xd864, 0xdd0e ) );

我确实在谷歌浏览器中得到了正确的结果---在浏览器窗口和控制台上：


2
0xd864
0xdd0e

2
0xd864
0xdd0e

但是，这是我在使用 nodejs 时得到的console.log：

<h1> � </h1>
<div> 1 </div>
<div>0x fffd </div>
<div>0x NaN </div>
<h1> �����&lt;/h1>
<div> 2 </div>
<div>0x d864 </div>
<div>0x dd0e </div>

这似乎表明解析带有 CID 的 utf-80xffff并将这些字符输出到控制台都被破坏了。顺便说一句，python 3.1 确实将字符视为代理对，并且可以将字符打印到控制台。

注意我已将此问题交叉发布到v8-users 邮件列表。

score 10 · Accepted Answer

这个最近的演示文稿涵盖了流行语言中 Unicode 的各种问题，并且对 Javascript 不友好：好的、坏的和（大部分）丑陋的

他讨论了 Javascript 中 Unicode 的两字节表示的问题：

UTF-16 nee UCS-2 诅咒

与其他几种语言一样，Javascript 也受到 UTF-16 诅咒的影响。除了 Javascript 有一个更糟糕的形式，UCS-2 诅咒。像 charCodeAt 和 fromCharCode 这样的东西只处理 16 位的数量，而不是真正的 21 位 Unicode 代码点。因此，如果您想打印出类似 , U+1D49C, MATHEMATICAL SCRIPT CAPITAL A 的内容，您必须指定的不是一个字符而是两个“字符单位”：“\uD835\uDC9C”。

// ERROR!! 
document.write(String.fromCharCode(0x1D49C));
// needed bogosity
document.write(String.fromCharCode(0xD835,0xDC9C));

score 2 · Accepted Answer

我认为这是一个 console.log 问题。由于 console.log 仅用于调试，当您通过 http 从节点输出到浏览器时是否有同样的问题？

javascript - 如何在 google v8（和 nodejs）中呈现 32 位 unicode 字符

2 回答 2

Related

Reference