9

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

4

2 回答 2

11

它们被称为组合变音符号。它们是 Unicode 的“一部分”……一些可组合的变音符号可以“链接”在任何字符上。很明显,在这种情况下字符串的长度是 2(因为有e'。为了兼容性而保留了类似的预组合字符àéèìòù,但现在任何字符都可以重音 :-) 显然 99% 的程序员不知道它,并且 99.9% 的程序对它的支持非常糟糕。我很确定它们可以在某处用作攻击媒介(但我不是偏执狂:-))

我什至还要补充一点,即使是 2009 年的 Skeet 也不确定它们是如何工作的:http ://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/

你看,我不记得组合字符是在基本字符之前还是之后

:-) :-)

于 2013-09-02T17:26:41.273 回答
8

而不是 UTF-8,它更有可能结合所涉及的变音符号。

>>> "e\u0301"
"é"
>>> "e\u0301".length
2

Javascript 的字符串通常编码为 UTF-16,因此它可以在 1 个代码单元中包含整个单个“é”(U+00e9)。


但是 BMP 之外的字符(那些代码点超出 U+FFFF 的字符)将返回 2,因为它们被编码为 2 个 UTF-16 代码单元。

>>> "".length
2
于 2013-09-02T17:29:32.797 回答