.net - .NET 可以将 Unicode 转换为 ASCII 以删除“智能引号”等吗？

Question

我们的一些用户使用无法处理 Unicode 的电子邮件客户端，即使在邮件标头中正确设置了编码等。

我想“规范化”他们收到的内容。我们遇到的最大问题是用户将 Microsoft Word 中的内容复制并粘贴到我们的 Web 应用程序中，然后通过电子邮件转发该内容 - 包括分数、智能引号和 Word 为您插入的所有其他扩展 Unicode 字符.

我猜这没有绝对的解决方案，但是在我坐下来开始编写很棒的大型查找表之前，是否有一些内置方法可以让我开始？

基本上涉及三个阶段。

首先，从其他正常字母中去除重音 - 解决方案在这里

This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions

去

This paragraph contains “smart quotes” and accents and ½ of the problem is fractions

其次，将单个 Unicode 字符替换为对应的 ASCII 字符，得到：

This paragraph contains "smart quotes" and accents and ½ of the problem is fractions

这是我希望在实施自己的解决方案之前有解决方案的部分。最后，用合适的 ASCII 序列替换特定字符 - ½ 到 1/2 等等 - 我很确定任何类型的 Unicode 魔法本身都不支持，但有人可能已经写了一个合适的查找表，我可以重复使用。

有任何想法吗？

score 24 · Accepted Answer

谢谢大家的一些非常有用的答案。我意识到实际的问题不是“如何将任何 Unicode 字符转换为其 ASCII 后备” - 问题是“如何将客户抱怨的 Unicode 字符转换为他们的 ASCII 后备”？

换句话说——我们不需要通用的解决方案；我们需要一个在 99% 的时间都可以工作的解决方案，用于将英语内容从 Word 和其他网站粘贴到我们的应用程序中的说英语的客户。为此，我使用以下测试分析了通过我们的系统发送的八年消息，以寻找无法用 ASCII 编码表示的字符：

///<summary>Determine whether the supplied character is 
///using ASCII encoding.</summary>
bool IsAscii(char inputChar) {
    var ascii = new ASCIIEncoding();
    var asciiChar = (char)(ascii.GetBytes(inputChar.ToString())[0]);
    return(asciiChar == inputChar);
}

然后，我浏览了生成的无法表示的字符集，并手动分配了适当的替换字符串。全部都捆绑在一个扩展方法中，因此您可以调用 myString.Asciify() 将您的字符串转换为合理的 ASCII 编码近似值。

public static class StringExtensions {
    private static readonly Dictionary<char, string> Replacements = new Dictionary<char, string>();
    /// <summary>Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent.  Yes, this is lossy.</summary>
    /// <param name="s">A string.</param>
    /// <returns>The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents.</returns>
    /// <remarks>This method is lossy. It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients.</remarks>
    public static string Asciify(this string s) {
        return (String.Join(String.Empty, s.Select(c => Asciify(c)).ToArray()));
    }

    private static string Asciify(char x) {
        return Replacements.ContainsKey(x) ? (Replacements[x]) : (x.ToString());
    }

    static StringExtensions() {
        Replacements['’'] = "'"; // 75151 occurrences
        Replacements['–'] = "-"; // 23018 occurrences
        Replacements['‘'] = "'"; // 9783 occurrences
        Replacements['”'] = "\""; // 6938 occurrences
        Replacements['“'] = "\""; // 6165 occurrences
        Replacements['…'] = "..."; // 5547 occurrences
        Replacements['£'] = "GBP"; // 3993 occurrences
        Replacements['•'] = "*"; // 2371 occurrences
        Replacements[' '] = " "; // 1529 occurrences
        Replacements['é'] = "e"; // 878 occurrences
        Replacements['ï'] = "i"; // 328 occurrences
        Replacements['´'] = "'"; // 226 occurrences
        Replacements['—'] = "-"; // 133 occurrences
        Replacements['·'] = "*"; // 132 occurrences
        Replacements['„'] = "\""; // 102 occurrences
        Replacements['€'] = "EUR"; // 95 occurrences
        Replacements['®'] = "(R)"; // 91 occurrences
        Replacements['¹'] = "(1)"; // 80 occurrences
        Replacements['«'] = "\""; // 79 occurrences
        Replacements['è'] = "e"; // 79 occurrences
        Replacements['á'] = "a"; // 55 occurrences
        Replacements['™'] = "TM"; // 54 occurrences
        Replacements['»'] = "\""; // 52 occurrences
        Replacements['ç'] = "c"; // 52 occurrences
        Replacements['½'] = "1/2"; // 48 occurrences
        Replacements[''] = "-"; // 39 occurrences
        Replacements['°'] = " degrees "; // 33 occurrences
        Replacements['ä'] = "a"; // 33 occurrences
        Replacements['É'] = "E"; // 31 occurrences
        Replacements['‚'] = ","; // 31 occurrences
        Replacements['ü'] = "u"; // 30 occurrences
        Replacements['í'] = "i"; // 28 occurrences
        Replacements['ë'] = "e"; // 26 occurrences
        Replacements['ö'] = "o"; // 19 occurrences
        Replacements['à'] = "a"; // 19 occurrences
        Replacements['¬'] = " "; // 17 occurrences
        Replacements['ó'] = "o"; // 15 occurrences
        Replacements['â'] = "a"; // 13 occurrences
        Replacements['ñ'] = "n"; // 13 occurrences
        Replacements['ô'] = "o"; // 10 occurrences
        Replacements['¨'] = ""; // 10 occurrences
        Replacements['å'] = "a"; // 8 occurrences
        Replacements['ã'] = "a"; // 8 occurrences
        Replacements['ˆ'] = ""; // 8 occurrences
        Replacements['©'] = "(c)"; // 6 occurrences
        Replacements['Ä'] = "A"; // 6 occurrences
        Replacements['Ï'] = "I"; // 5 occurrences
        Replacements['ò'] = "o"; // 5 occurrences
        Replacements['ê'] = "e"; // 5 occurrences
        Replacements['î'] = "i"; // 5 occurrences
        Replacements['Ü'] = "U"; // 5 occurrences
        Replacements['Á'] = "A"; // 5 occurrences
        Replacements['ß'] = "ss"; // 4 occurrences
        Replacements['¾'] = "3/4"; // 4 occurrences
        Replacements['È'] = "E"; // 4 occurrences
        Replacements['¼'] = "1/4"; // 3 occurrences
        Replacements['†'] = "+"; // 3 occurrences
        Replacements['³'] = "'"; // 3 occurrences
        Replacements['²'] = "'"; // 3 occurrences
        Replacements['Ø'] = "O"; // 2 occurrences
        Replacements['¸'] = ","; // 2 occurrences
        Replacements['Ë'] = "E"; // 2 occurrences
        Replacements['ú'] = "u"; // 2 occurrences
        Replacements['Ö'] = "O"; // 2 occurrences
        Replacements['û'] = "u"; // 2 occurrences
        Replacements['Ú'] = "U"; // 2 occurrences
        Replacements['Œ'] = "Oe"; // 2 occurrences
        Replacements['º'] = "?"; // 1 occurrences
        Replacements['‰'] = "0/00"; // 1 occurrences
        Replacements['Å'] = "A"; // 1 occurrences
        Replacements['ø'] = "o"; // 1 occurrences
        Replacements['˜'] = "~"; // 1 occurrences
        Replacements['æ'] = "ae"; // 1 occurrences
        Replacements['ù'] = "u"; // 1 occurrences
        Replacements['‹'] = "<"; // 1 occurrences
        Replacements['±'] = "+/-"; // 1 occurrences
    }
}

请注意，那里有一些相当奇怪的后备——比如这个：

Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences

那是因为我们的一个用户有一些程序可以将打开/关闭智能引号转换为 ² 和 ³（例如：他说 ²hello³）并且没有人使用它们来表示幂，所以这对我们来说可能工作得很好，但是 YMMV .

score 6 · Accepted Answer

在使用最初在 Word 中构建的字符串列表时，我自己也遇到了一些问题。我发现使用简单的"String".replace(current char/string, new char/string)命令效果很好。我使用的确切代码是用于智能引号，或者更准确地说：left ", right ", left ', and right ' 如下：

StringName = StringName.Replace(ChrW(8216), "'")     ' Replaces any left ' with a normal '
StringName = StringName.Replace(ChrW(8217), "'")     ' Replaces any right ' with a normal '
StringName = StringName.Replace(ChrW(8220), """")    ' Replace any left " with a normal "
StringName = StringName.Replace(ChrW(8221), """")    ' Replace any right " with a normal "

我希望这可以帮助任何仍然有这个问题的人！

score 1 · Accepted Answer

是否有一些内置方法可以帮助我入门？

我要尝试的第一件事是使用Normalize on strings 方法将文本转换为 NFKD规范化形式。在您链接的问题的答案中提到了此建议，但我建议使用 NFKD 而不是 NFD，因为 NFKD 将删除不需要的印刷区别（例如，NBSP → 空格或ℂ → C）。

您还可以按Unicode 类别进行通用替换。例如，Pd 可以替换为-，Nd 可以替换为对应的0-9数字，Mn 可以替换为空字符串（以去除重音符号）。

但有人可能写了一个合适的查找表，我可以重复使用。

您可以尝试使用 Unidecode 程序或CLDR中的数据。

编辑：这里有一个巨大的替代图表。

score -1 · Accepted Answer

你永远不应该尝试将 Unicode 转换为 ASCII，因为你最终会遇到比解决更多的问题。

这就像试图将 1,114,112 个代码点 (Unicode 6.0) 放入 128 个字符中。

你认为你会成功吗？

顺便说一句，Unicode 中有很多引号，不仅是您提到的引号，而且如果您仍然想要进行转换，请记住转换将取决于语言环境。

检查ICU - 包含最完整的 Unicode 转换例程。

.net - .NET 可以将 Unicode 转换为 ASCII 以删除“智能引号”等吗？

4 回答 4

Related

Reference