c# - 为什么 Đ 在删除重音/变音符号时不会变平为 D

Question

我正在使用这种方法从我的字符串中删除重音：

static string RemoveAccents(string input)
{
    string normalized = input.Normalize(NormalizationForm.FormKD);
    StringBuilder builder = new StringBuilder();
    foreach (char c in normalized)
    {
        if (char.GetUnicodeCategory(c) !=
        UnicodeCategory.NonSpacingMark)
        {
            builder.Append(c);
        }
    }
    return builder.ToString();
}

但是这种方法将 đ 保留为 đ 并且不会将其更改为 d，即使 d 是它的基本字符。您可以使用此输入字符串“æøåáâăäĺćççéęëěíîďđńňóôőöřůúűüýţ”尝试它

字母đ有什么特别之处？

score 14 · Accepted Answer

为什么它不起作用的答案是“d 是它的基本字符”的说法是错误的。U+0111 (LATIN SMALL LETTER D WITH STROKE) 具有 Unicode 类别“字母，小写”并且没有分解映射（即，它不分解为“d”后跟一个组合标记）。

"đ".Normalize(NormalizationForm.FormD)简单地返回"đ"，它不会被循环剥离，因为它不是非间距标记。

对于“ø”和其他 Unicode 不提供分解映射的字母，也会存在类似的问题。（如果您试图找到“最佳”ASCII 字符来表示 Unicode 字母，那么这种方法对于西里尔文、希腊文、中文或其他非拉丁字母完全不起作用；如果出现以下情况，您也会遇到问题例如，您想将“ß”音译为“ss”。使用UnidecodeSharp 之类的库可能会有所帮助。）

score 3 · Accepted Answer

我不得不承认我不确定为什么会这样，但它确实似乎

var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str));

=>“aoaaaaalccceeeeiiddnnooooruuuyt”

score 3 · Accepted Answer

“带笔划的 D ”（维基百科）在多种语言中使用，并且似乎在所有语言中都被认为是一个不同的字母——这就是它保持不变的原因。

score 0 · Accepted Answer

string.Normalize(NormalizationForm)是删除“真实”变音符号 ( Wiki ) 的一种简单方法，但您可能想要转换的许多字母不受此影响。

我在 Ð & ð（字母Eth）、đ、Æ 和 æ 方面遇到了类似的问题。要将它们转换为 ANSI（拉丁文），请改用 Unicode 转换！

    private static char[] ConvertUnicodeStringToSpecificEncoding(string input, int resultEncodingCode)
    {
        System.Text.Encoding unicodeEncoding = System.Text.Encoding.Unicode;
        System.Text.Encoding specificEncoding = System.Text.Encoding.GetEncoding(resultEncodingCode);

        byte[] convertedBytes = System.Text.Encoding.Convert(unicodeEncoding, specificEncoding, unicodeEncoding.GetBytes(input));
        char[] convertedChars = new char[specificEncoding.GetCharCount(convertedBytes, 0, convertedBytes.Length)];
        specificEncoding.GetChars(convertedBytes, 0, convertedBytes.Length, convertedChars, 0);
        return convertedChars;
    }

在同一字符串上使用多个编码调用此方法，以在您想要留下的字母上创建一个交集。

编码列表： https ://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8

我的解决方案看起来像这样

    // Encoding Types (int Codes) https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8
    private static readonly char[] charactersToSkip = new char[] { 'ä', 'ö', 'ü', 'Ä', 'Ö', 'Ü' };
    private static readonly char[] specialCharsToSkip = new char[] { '^', '´', '`', '°', '!', '\'', '§', '$', '%', '&', '/', '(', ')', '=', '{', '[', ']', '}', '\\', '+', '-' };
    private static readonly char[] ambiguousCharsToSkip = new char[] { '?' };   // Chars which might be a result of encoding-conversion and have to be skipped beforehand.
    private static readonly int[] encodingsToRemoveDiacritics = new int[]
    {
        852,    // 852  ibm852  Central European (DOS)
        850,    // 850  ibm850  Western European (DOS)
        860,    // 860  IBM860  Portuguese (DOS)    

        /* Warning:
         * Only append encodings.
         * Changing sort order of encodings may result in malfunctioning.
         */ 
    };

    public static string RemoveDiacritics(this string inputString)
    {
        if (string.IsNullOrEmpty(inputString))
        {
            return inputString;
        }

        var resultStringBuilder = new StringBuilder();

        foreach (char currentChar in inputString)
        {
            if (charactersToSkip.Contains(currentChar) || specialCharsToSkip.Contains(currentChar) || ambiguousCharsToSkip.Contains(currentChar))
            {
                resultStringBuilder.Append(currentChar);
                continue;
            }

            string normalizedString = currentChar.ToString().Normalize(NormalizationForm.FormD);
            foreach (char normalizedChar in normalizedString)
            {
                if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(normalizedChar) != System.Globalization.UnicodeCategory.NonSpacingMark)
                {
                    string convertedString = normalizedChar.ToString();
                    char[] convertedChars = null;

                    foreach (int encodingCode in encodingsToRemoveDiacritics)
                    {
                        convertedChars = ConvertUnicodeStringToSpecificEncoding(convertedString, encodingCode);

                        if (convertedChars.Contains('?') == false)
                        {
                            convertedString = new string(convertedChars);
                        }
                    }

                    resultStringBuilder.Append(convertedString);
                }
            }
        }

        return resultStringBuilder.ToString();
    }

创建以下输出

"abcdefghijklmnopqrstuvwxzy" -> "abcdefghijklmnopqrstuvwxzy"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" -> "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"1234567890" -> "1234567890"
"ß" -> "ß"
"ÄÖÜ" -> "ÄÖÜ"
"äöü" -> "äöü"
"!\"§$%&/()=?" -> "!\"§$%&/()=?"
"+-_~'*#" -> "+-_~'*#"
",.;:" -> ",.;:"
"µ" -> "u" // My -> u
"<>|" -> "<>|"
"´`^°" -> "´`^°"
"²" -> "2" // ² -> 2
"³" -> "3" // ³ -> 3
"{}" -> "{}"
"[]" -> "[]"
"\\" -> "\\"
"áàâã" -> "aaaa"
"ÁÀÂÅ" -> "AAAA"
"éèêę" -> "eeee"
"ÉÈÊĚ" -> "EEEE"
"íìîï" -> "iiii"
"ÍÌÎ" -> "III"
"óòôõ" -> "oooo"
"ÓÒÔŌ" -> "OOOO"
"úùû" -> "uuu"
"ÚÙÛ" -> "UUU"
"ÇĆĈČĊ" -> "CCCCC"
"çćĉčċ" -> "ccccc"
"Ñ" -> "N"
"Æ" -> "A"
"æ" -> "a"
"ýÿ" -> "yy"
"ĹĻĽ" -> "LLL"
"Ð" -> "D"
"đ" -> "d"
"ð" -> "d"

score -4 · Accepted Answer

这应该工作

    private static String RemoveDiacritics(string text)
    {
        String normalized = text.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < normalized.Length; i++)
        {
            Char c = normalized[i];
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                sb.Append(c);
        }

        return sb.ToString();
    }

c# - 为什么 Đ 在删除重音/变音符号时不会变平为 D

5 回答 5

Related

Reference