javascript - Strange length of accent as "é" string return 2

Question

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

score 11 · Accepted Answer

它们被称为组合变音符号。它们是 Unicode 的“一部分”……一些可组合的变音符号可以“链接”在任何字符上。很明显，在这种情况下字符串的长度是 2（因为有e和'。为了兼容性而保留了类似的预组合字符àéèìòù，但现在任何字符都可以重音 :-) 显然 99% 的程序员不知道它，并且 99.9% 的程序对它的支持非常糟糕。我很确定它们可以在某处用作攻击媒介（但我不是偏执狂:-)）

我什至还要补充一点，即使是 2009 年的 Skeet 也不确定它们是如何工作的：http ://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/

你看，我不记得组合字符是在基本字符之前还是之后

:-) :-)

score 8 · Accepted Answer

而不是 UTF-8，它更有可能结合所涉及的变音符号。

>>> "e\u0301"
"é"
>>> "e\u0301".length
2

Javascript 的字符串通常编码为 UTF-16，因此它可以在 1 个代码单元中包含整个单个“é”（U+00e9）。

但是 BMP 之外的字符（那些代码点超出 U+FFFF 的字符）将返回 2，因为它们被编码为 2 个 UTF-16 代码单元。

>>> "".length
2

javascript - Strange length of accent as "é" string return 2

2 回答 2

Related

Reference