5

使用 C# 如何从字符串中删除 utf8mb4 字符(表情符号等),以便结果完全符合 utf8。

大多数解决方案都涉及更改数据库配置,但不幸的是我没有这种可能性。

4

1 回答 1

7

This should replace surrogate characters with a replacementCharacter (that could even be string.Empty)

This is a MySql problem, given the utf8mb4. Here there is the difference between utf8 and utf8mb4 in MySql. The difference is that utf8 doesn't support 4 byte utf8 sequences. By looking at the wiki, 4 byte utf8 sequences are those > 0xFFFF, so that in utf16 require two char (that are named surrogate pairs). This method remove surrogate pairs characters. When found "coupled" (a high + a low surrogate pair), then a single replacementCharacter is substituted, otherwise a orphan (wrong) high or a low surrogate pair is replaced by a replacementCharacte.

public static string RemoveSurrogatePairs(string str, string replacementCharacter = "?")
{
    if (str == null)
    {
        return null;
    }

    StringBuilder sb = null;

    for (int i = 0; i < str.Length; i++)
    {
        char ch = str[i];

        if (char.IsSurrogate(ch))
        {
            if (sb == null)
            {
                sb = new StringBuilder(str, 0, i, str.Length);
            }

            sb.Append(replacementCharacter);

            // If there is a high+low surrogate, skip the low surrogate
            if (i + 1 < str.Length && char.IsHighSurrogate(ch) && char.IsLowSurrogate(str[i + 1]))
            {
                i++;
            }
        }
        else if (sb != null)
        {
            sb.Append(ch);
        }
    }

    return sb == null ? str : sb.ToString();
}
于 2015-05-22T10:03:25.003 回答