“surrogate-pairs”的相关标签问题

0 投票

2 回答

115 浏览

c# - xUnit.net：为什么这两个等效测试的结果不同？

由于某种原因，此测试InlineData在 xUnit 中使用失败：

而使用MemberData的 this 通过：

这是什么原因？我在 xUnit.net 中发现了一个错误吗？（我认为这可能与它\uD800是一个代理字符这一事实有关，并且在通过时它以某种方式被翻译成 2 个字符InlineData。但不知道为什么。）

c#.net xunit xunit.net surrogate-pairs

2016-03-19T17:22:54.343

0 投票

2 回答

356 浏览

python - 涉及星体平面的 unicode 范围的 Python 语义

如果范围的一个或两个端点都在 BMP 之外，那么正则表达式中字符范围的预期语义到底是什么？我观察到以下输入在 Python 2.7 和 3.5 中的行为不同：

在我的 2.7 中，我得到False了，在 3.5 中我得到了True。后者对我来说很有意义。前者可能是由于\U00021111由代理对表示\ud844\udd11，但即使那样我也不明白，因为\u1000-\ud844应该包含\u1234就好了。

这是在某处指定的吗？
这是预期的行为吗？
这仅取决于 Python 版本，还是取决于有关 UTF-16 与 UTF-32 的编译时标志？
有没有办法在不区分大小写的情况下获得一致的行为？
如果区分大小写是不可避免的，那么条件是什么？

python regex unicode surrogate-pairs astral-plane

2016-04-21T08:05:35.783

0 投票

2 回答

759 浏览

javascript - 如何仅迭代我实际可以看到的字符串中的字符？

通常我会使用类似的东西str[i]。

但万一str = "☀️"呢？

str[i]失败。for (x of str) console.log(x)也失败了。它总共打印出 4 个字符，即使字符串中显然只有 2 个表情符号。

迭代我在字符串中可以看到的每个字符（我猜还有换行符）的最佳方法是什么，仅此而已？

理想的解决方案将返回一个包含 2 个字符的数组：2 个表情符号，仅此而已。声称的副本以及我发现的许多其他解决方案不符合此标准。

javascript unicode surrogate-pairs astral-plane

2016-04-22T04:40:29.877

0 投票

2 回答

15575 浏览

python - 如何在 Python 中将代理对转换为普通字符串？

这是Converting to Emoji的后续内容。在那个问题中，OP 有一个json.dumps()-encoded 文件，其中的表情符号表示为代理对 - \ud83d\ude4f。他/她在读取文件和正确翻译表情符号时遇到问题，正确答案是json.loads()文件中的每一行，json模块将处理从代理对转换回（我假设是 UTF8 编码的）表情符号。

所以这是我的情况：假设我只有一个普通的 Python 3 unicode 字符串，其中有一个代理对：

如何处理此字符串以从中获取表情符号的表示形式？我正在寻找这样的东西：

我试过了：

通常我会收到类似于UnicodeEncodeError: XXX codec can't encode character '\ud83d' in position 8: surrogates no allowed.

我在 Linux 上运行 Python 3.5.1，$LANG设置为en_US.UTF-8. 我已经在命令行的 Python 解释器和在 Sublime Text 中运行的 IPython 中运行了这些示例——似乎没有任何区别。

python python-3.x unicode surrogate-pairs

2016-07-01T13:55:31.177

0 投票

1 回答

1904 浏览

c# - How to decode surrogate characters encoded as UTF8?

My C# program gets some UTF-8 encoded data and decodes it using Encoding.UTF8.GetString(data). When the program that produces the data gets characters outside the BMP, it encodes them as 2 surrogate characters, each encoded as UTF-8 separately. In such cases, my program can't decode them properly.

How can I decode such data in C#?

Example:

Note: The encoding program is written in C++, and converts the data using std::codecvt_utf8<wchar_t> (code below). As @PeterDuniho's answer correctly notes, it should've used std::codecvt_utf8_utf16<wchar_t>. Unfortunately, I don't control this program, and can't change its behavior - only handle its malformed input.

c#c++unicode utf-8 surrogate-pairs

2016-07-10T15:16:20.587

0 投票

2 回答

3737 浏览

python - Python：从非 BMP unicode char 中查找等效代理对

这里给出的答案：如何在 Python 中使用代理对？告诉您如何将代理对转换'\ud83d\ude4f'为单个非 BMP unicode 字符（答案是"\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')）。我想知道如何反向执行此操作。我如何使用 Python 从非 BMP 字符中找到等效的代理对，将'\U0001f64f'() 转换回'\ud83d\ude4f'. 我找不到明确的答案。

python unicode encoding emoji surrogate-pairs

2016-10-24T16:13:24.767

0 投票

1 回答

779 浏览

java - Eclipse IDE processing emojis using surrogate pairs

I am not able to find a clear answer to this. Does the ECLIPSE IDE support emojis? I have read a lot about surrogate pairs here on stack overflow, but I am unable to get a clear answer on this.

I am having to read in a text file character by character and I am using FileInputStream.

Would it be possible to process the emojis using surrogate pairs? I am wanting to use a select few apple emojis. These specifically: By process them, I mean I would like to identify them as that particular emoji when reading in the file.

If so, could someone show me an example?

java eclipse surrogate-pairs

2016-11-01T19:29:52.573

0 投票

2 回答

2126 浏览

java - 如何生成包含补充字符的随机 Unicode 字符串？

我正在研究一些用于生成随机字符串的代码。结果字符串似乎包含无效char组合。具体来说，我发现高代理项后面没有低代理项。

谁能解释为什么会这样？我是否必须明确生成随机低代理来跟随高代理？我以为这不是必需的，因为我使用的int是Character。

这是测试代码，在最近的一次运行中产生了以下错误配对：

java unicode surrogate-pairs

2016-12-12T21:28:44.247

0 投票

0 回答

298 浏览

java - Java Xml 转换转义代表补充字符的代理代码单元

我正在 servlets Tomcat 8.0 的容器中执行一个 Web 应用程序。在请求中，我尝试使用下面的代码将输入数据转换为 XML。第一个输入数据字符是一个unicode补充字符U+16980，表示为字符对\ud81a\udd80，第二个字符是另一个补充字符U+16990，表示为字符对\ud81a\udd90。

我期待：<root><sofa>𖦀 𖦐 � �</sofa></root>

但相反，我得到：<root><sofa>&#55322;&#56704; &#55322;&#56720; � �</sofa> </root>

java xml surrogate-pairs

2016-12-26T23:22:26.050

0 投票

0 回答

2404 浏览

c++ - 检查 UTF-8 字符串在现代 C++ 中是否有效

众所周知，C++11 的标准库允许轻松地将字符串从 UTF-8 编码转换为 UTF-16。但是，以下代码成功地转换了无效的 UTF-8 输入（至少在 MSVC2010 下）：

这里的字符串包含 9 个字节，3 个代码点。最后一个代码点是 0xDB8D，它是无效的（适合代理项的范围）。

是否可以仅使用现代 C++ 标准库检查 UTF-8 字符串的完美有效性？在这里，我的意思是不允许维基百科文章中描述的所有无效案例。

c++utf-8 surrogate-pairs

2017-01-14T17:28:55.570

问题标签 [surrogate-pairs]

Reference