TomC 建议在输入时分解 Unicode 字符,并在输出时重新组合 ( )。
TomC 建议在输入时分解 Unicode 字符,并在输出时重新组合 ( )。
To just display a decomposed character, the rendering software needs to deal with combining diacritic marks. It does not suffice to find them in the font. The renderer needs to position the diacritic properly, using information about the dimensions of the base character. There are often problems with this, resulting in poor rendering—especially if the rendering uses the diacritic from a different font! The result can hardly be better than what is achieved by simply displaying the glyph of a precomposed character like “é”, designed by a typographer.
(Rendering software can also analyze the situation and effectively map the decomposed character to a precomposed character. But that would require extra code.)
这很简单:大多数工具对 Unicode 的支持有限;他们假设字符采用 NFC 形式。
perl -CSDA -e"use utf8; if ($ARGV[0] eq "Éric") { ... }"
当然,“É”是 NFC 形式的(因为这是几乎所有东西都会产生的),所以这个程序只接受 NFC 形式的参数。
You should one normalization form so all the data have the same normalization, so why not choose the potentially shorter one?
As for someone else's decomposition, remember that you want to be strict with what you output but liberal with what you accept. :)
Tom Christiansen is an active participant on StackOverflow and answers a lot of Perl questions. There's a good chance he'll answer this question.
Certain character sequences such as ff
can be represented in UTF-8 as either two Unicode characters f
and f
, or as a single Unicode character (ff
). When you decompose your characters, you're making things like ff
become two separate characters which would be important for sorting. You want this to be two separate letter f
when you sort.
When you recompose UTF-8 f
and f
, they go back to the single UTF-8 character which would be important for displaying (you want them to format nicely) and for editing (you want to edit it as a single character).
Unfortunately, my theory falls apart with things like the Spanish ñ. This is represented as U+00F1 as a single character, and decomposes into U+006E (n) and U+0303 (in-place ~). Maybe Perl has the logic built in to handle this type of two UTF-8 decompose character representation.