“astral-plane”的相关标签问题

0 投票

2 回答

2573 浏览

c# - 字符串和 4 字节 Unicode 字符

我有一个关于 C# 中的字符串和字符的问题。我发现 C# 中的一个字符串是一个 Unicode 字符串，一个 char 需要 2 个字节。所以每个字符都是 UTF-16 编码。太好了，但我也在 Wikipedia 上读到，UTF-16 中有一些字符占用 4 个字节。

我正在做一个程序，可以让你为字母数字显示器绘制字符。在程序中还有一个测试器，你可以在里面写一些字符串，它会画出来让你看看它的样子。

那么我应该如何使用字符串，用户写入一个需要 4 个字节的字符，即 2 个字符。因为我需要逐个字符地遍历字符串，在列表中找到这个字符，并将其绘制到面板中。

c#string unicode astral-plane

2012-12-23T11:53:13.477

0 投票

1 回答

208 浏览

unicode - Antlr 生成的词法分析器挂在“补充平面”的 unicode 字符上（antlr 3.4）

我正在使用 antlr Grammar 和antlr Ruby Target解析 PHP 代码。我必须解析的源文件之一实际上包含翻译，其中一些大量使用 Unicode 字符。语法似乎挂在“补充平面”的一个字符上，即U+10430。

过去我也遇到过类似的问题，因为 Ruby antlr 目标很老，而且不兼容 unicode（嗯，当时 Ruby 不兼容）。我们不得不将 RubyTarget.java getMaxCharValue 从 0xFF (ascii) 提高到 0xFFFF (unicode) 来解决它。现在看来，即使是这一套也不够。Unicode 声明此范围之外的字符可以使用两个 UTF-16 字符表示，但 antlr 如何管理呢？再次碰撞 getMaxCharValue 会有所帮助吗（它曾经做过一次，但我不喜欢“尝试”方法）？

谢谢！

unicode antlr astral-plane

2012-12-26T14:16:13.210

0 投票

4 回答

8826 浏览

java - Java charAt 用于具有两个代码单元的字符

来自核心 Java，第一卷。1，第 9 版，第 69：

字符 ℤ 在 UTF-16 编码中需要两个代码单元。打电话

不返回空格，而是返回 ℤ 的第二个代码单元。

但似乎sentence.charAt(1) 确实返回了一个空格。例如，if以下代码中的语句计算结果为true。

为什么？

如果相关，我在 Ubuntu 12.10 上使用 JDK SE 1.7.0_09。

java unicode utf-16 surrogate-pairs astral-plane

2013-01-04T03:05:11.280

0 投票

2 回答

791 浏览

unicode - 如何在 Rebol 3 字符串中使用高于 U+FFFF 的 Unicode 代码点，如 Rebol 2？

我知道在 Rebol 2 中，对于大于 ^(FF) 的代码点，您不能在字符串中使用插入符号样式转义，因为它对 Unicode 一无所知。所以这不会产生任何好处，它看起来很乱：

然而代码在 Rebol 3 中工作并打印出来：

这很好，但是 R3 显然在 U+FFFF 处最大限度地发挥了它在字符串中保存字符的能力：

这种情况比 Rebol 2 在遇到它不知道的代码点时的随机行为要好得多。但是，如果您知道如何进行自己的 UTF-8 编码（或者通过从磁盘加载源代码来获取字符串），那么在 Rebol 中曾经有一种用于存储字符串的解决方法。您可以将它们从单个字符组装起来。

所以 U+010000 的 UTF-8 编码是#F0908080，你之前可以说：

您会得到一个使用 UTF-8 编码的单个代码点的字符串，您可以将其以代码块的形式保存到磁盘并再次读回。R3中是否有类似的技巧？

unicode rebol rebol3 astral-plane rebol2

2013-02-25T22:44:30.477

0 投票

3 回答

8317 浏览

c# - 如何在 C# 控制台中显示扩展的 Unicode 字符？

我正在尝试显示一组扑克牌，它们的 Unicode 值在 1F0A0 到 1F0DF 范围内。每当我尝试在其代码中使用超过 4 个字符的字符时，都会出现错误。是否可以在这种情况下使用这些字符？我正在使用 Visual Studio 2012。

字符 AceOfSpades = '\u1F0A0'; 输入后立即给我错误“字符文字中的字符太多”这仍然显示为 Unicode 或 UTF8 编码。如果我尝试像上面那样显示 '\u1F0A'... 使用 Unicode 它会显示 '?' 使用 UTF8，它显示 3 个字符。

我尝试了 OutputEncoding string AceOfSpades = "\U0001F0A0"; Default, Unicode, ASCII 的所有给定选项： ?? UTF7: +2DzcoA- UTF8: 四个奇怪的字符 UTF32 , BigEndianUnicode: IOException Console.OutputEncoding = System.Text.Encoding.UTF32;尽管是一个选项，但即使它是唯一的代码行也会崩溃。UTF16 不在列表中。

如何检查我使用的 Unicode 版本？

c#unicode astral-plane

2013-03-01T23:24:15.447

0 投票

2 回答

1108 浏览

c# - C# char 中的 Unicode SMP“字符”

我正在尝试确定字符编码对我正在计划的软件系统的影响，并且在进行测试时发现了一些奇怪的东西。

据我所知，C# 内部使用 UTF-16，它（据我所知）包含使用两个 16 位字段的每个 Unicode 代码点。所以我想做一些字符文字，特意选择了和얤，因为前者来自SMP平面，后者来自BMP平面。结果是：

这是怎么回事？

这个问题的一个推论是，如果我有字符串“얤얤”，它会在 MessageBox 中正确显示，但是当我使用 ToCharArray 将其转换为 char[] 时，我会得到一个包含四个元素而不是三个元素的数组。此外，String.Length 被报告为四个而不是三个。

我在这里错过了什么吗？

c#character-encoding astral-plane

user800576

2013-05-10T15:48:39.183

0 投票

1 回答

7807 浏览

javascript - Remove Unicode characters within various ranges in javascript

I'm trying to remove every Unicode character in a string if it falls in any the ranges below.

As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace function.

In this case, the character seems to have been replaced fine.

However, when I replace that with

I see something unexpected. My output shows up as:

he�llo worl᷿fd is replaced with

There are two things to note here:

\u1dfff does not show up as one character - \u1dff gets converted to a character and the f at the end it treated as its own character
the result is an empty string.

Any suggestions on how I can accomplish this would be much appreciated.

EDIT

My overall goal is to filter out all characters that the encodeURIComponent function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff to a unicode character before passing that to encodeURIComponent causes an exception to be raised by the latter function.

I edited parts of the question after @Blender pointed out that i was using x instead of u in my code to represent Unicode characters.

EDIT 2

I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.

javascript regex unicode replace astral-plane

2013-06-02T02:27:25.313

0 投票

1 回答

1750 浏览

c# - 将字符串转换为其代码点

我必须将大量字符转换为它们的 Unicode 代码点等价物。我正在使用以下代码进行此转换：

这适用于更普通的字符，但是我有这样的字符，ǎ其中实际字符串包含 2 个字符a (U-0061)和'̌' (U-030C). 那里的函数ConverToUtf32(string, int)只返回我实际期望的第一个（或另一个取决于索引）字符U-0103。使用ConvertToUtf32(char, char)不起作用，因为这需要更高代码点的字符。

是否有另一个函数可以用来将字符串转换为它们的代码点，或者我可以执行一个计算？

c#.net unicode astral-plane

user97462

2013-07-23T07:43:03.590

0 投票

2 回答

356 浏览

python - 涉及星体平面的 unicode 范围的 Python 语义

如果范围的一个或两个端点都在 BMP 之外，那么正则表达式中字符范围的预期语义到底是什么？我观察到以下输入在 Python 2.7 和 3.5 中的行为不同：

在我的 2.7 中，我得到False了，在 3.5 中我得到了True。后者对我来说很有意义。前者可能是由于\U00021111由代理对表示\ud844\udd11，但即使那样我也不明白，因为\u1000-\ud844应该包含\u1234就好了。

这是在某处指定的吗？
这是预期的行为吗？
这仅取决于 Python 版本，还是取决于有关 UTF-16 与 UTF-32 的编译时标志？
有没有办法在不区分大小写的情况下获得一致的行为？
如果区分大小写是不可避免的，那么条件是什么？

python regex unicode surrogate-pairs astral-plane

2016-04-21T08:05:35.783

0 投票

2 回答

759 浏览

javascript - 如何仅迭代我实际可以看到的字符串中的字符？

通常我会使用类似的东西str[i]。

但万一str = "☀️"呢？

str[i]失败。for (x of str) console.log(x)也失败了。它总共打印出 4 个字符，即使字符串中显然只有 2 个表情符号。

迭代我在字符串中可以看到的每个字符（我猜还有换行符）的最佳方法是什么，仅此而已？

理想的解决方案将返回一个包含 2 个字符的数组：2 个表情符号，仅此而已。声称的副本以及我发现的许多其他解决方案不符合此标准。

javascript unicode surrogate-pairs astral-plane

2016-04-22T04:40:29.877

问题标签 [astral-plane]

Reference