perl - 为什么要在退出时重构 Unicode (NFC)？

Question

TomC 建议在输入时分解 Unicode 字符，并在输出时重新组合 ( http://www.perl.com/pub/2012/04/perl-unicode-cookbook-always-decompose-and-recompose.html )。

前者对我来说很有意义，但我不明白他为什么建议在出路时重新组合。如果您的文本中包含大量欧洲重音字符，您可能会节省少量空间，但您只是将其推向其他人的分解功能。

我失踪还有其他明显的原因吗？

score 5 · Accepted Answer

正如文达苏在评论中所写，有软件可以处理组合字符，但不能处理分解字符。虽然理论上相反的情况也是可能的，但我在实践中从未见过它，并希望它很少见。

To just display a decomposed character, the rendering software needs to deal with combining diacritic marks. It does not suffice to find them in the font. The renderer needs to position the diacritic properly, using information about the dimensions of the base character. There are often problems with this, resulting in poor rendering—especially if the rendering uses the diacritic from a different font! The result can hardly be better than what is achieved by simply displaying the glyph of a precomposed character like “é”, designed by a typographer.

(Rendering software can also analyze the situation and effectively map the decomposed character to a precomposed character. But that would require extra code.)

score 2 · Accepted Answer

这很简单：大多数工具对 Unicode 的支持有限；他们假设字符采用 NFC 形式。

例如，这通常是人们比较字符串的方式：

perl -CSDA -e"use utf8; if ($ARGV[0] eq "Éric") { ... }"

当然，“É”是 NFC 形式的（因为这是几乎所有东西都会产生的），所以这个程序只接受 NFC 形式的参数。

score 0 · Accepted Answer

它会使文本编辑器之类的东西变得更简单，因为最终用户会期望一个可见字符是一个字符而不是多个字符。它还可以防止不将分解字符视为“单个”字符的系统出现问题。

除此之外，我没有看到特别的优势。

score 0 · Accepted Answer

You should one normalization form so all the data have the same normalization, so why not choose the potentially shorter one?

As for someone else's decomposition, remember that you want to be strict with what you output but liberal with what you accept. :)

score -3 · Accepted Answer

Tom Christiansen is an active participant on StackOverflow and answers a lot of Perl questions. There's a good chance he'll answer this question.

Certain character sequences such as ff can be represented in UTF-8 as either two Unicode characters f and f, or as a single Unicode character (ff). When you decompose your characters, you're making things like ff become two separate characters which would be important for sorting. You want this to be two separate letter f when you sort.

When you recompose UTF-8 f and f, they go back to the single UTF-8 character which would be important for displaying (you want them to format nicely) and for editing (you want to edit it as a single character).

Unfortunately, my theory falls apart with things like the Spanish ñ. This is represented as U+00F1 as a single character, and decomposes into U+006E (n) and U+0303 (in-place ~). Maybe Perl has the logic built in to handle this type of two UTF-8 decompose character representation.

perl - 为什么要在退出时重构 Unicode (NFC)？

5 回答 5

Related

Reference