c# - How to decode surrogate characters encoded as UTF8?

Question

My C# program gets some UTF-8 encoded data and decodes it using Encoding.UTF8.GetString(data). When the program that produces the data gets characters outside the BMP, it encodes them as 2 surrogate characters, each encoded as UTF-8 separately. In such cases, my program can't decode them properly.

How can I decode such data in C#?

Example:

static void Main(string[] args)
{
    string orig = "";
    byte[] correctUTF8 = Encoding.UTF8.GetBytes(orig); // Simulate correct conversion using std::codecvt_utf8_utf16<wchar_t>
    Console.WriteLine("correctUTF8: " + BitConverter.ToString(correctUTF8));  // F0-9F-8C-8E - that's what the C++ program should've produced

    // Simulate bad conversion using std::codecvt_utf8<wchar_t> - that's what I get from the program
    byte[] badUTF8 = new byte[] { 0xED, 0xA0, 0xBC, 0xED, 0xBC, 0x8E };
    string badString = Encoding.UTF8.GetString(badUTF8); // ���� (4 * U+FFFD 'REPLACMENT CHARACTER')
    // How can I convert this?
}

Note: The encoding program is written in C++, and converts the data using std::codecvt_utf8<wchar_t> (code below). As @PeterDuniho's answer correctly notes, it should've used std::codecvt_utf8_utf16<wchar_t>. Unfortunately, I don't control this program, and can't change its behavior - only handle its malformed input.

std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8Converter;
std::string utf8str = utf8Converter.to_bytes(wstr);

score 3 · Accepted Answer

如果没有一个好的Minimal、Complete 和 Verifiable 代码示例，就不可能确定。但在我看来，您好像在 C++ 中使用了错误的转换器。

语言环境从 UCS-2 转换，而std::codecvt_utf8<wchar_t>不是 UTF-16。两者非常相似，但 UCS-2 不支持对要编码的字符进行编码所需的代理对。

相反，您应该使用std::codecvt_utf8_utf16<wchar_t>：

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf8Converter;
std::string utf8str = utf8Converter.to_bytes(wstr);

当我使用该转换器时，我得到了所需的 UTF-8 字节：F0 9F 8C 8E. 当然，这些在 .NET 中被解释为 UTF-8 时可以正确解码。

附录：

该问题已更新以指示无法更改编码代码。你被 UCS-2 卡住了，它被编码成无效的 UTF8。由于 UTF8 无效，您必须自己解码文本。

我看到了几种合理的方法来做到这一点。首先，编写一个不关心 UTF8 是否包含无效字节序列的解码器。其次，使用 C++std::wstring_convert<std::codecvt_utf8<wchar_t>>转换器为您解码字节（例如，用 C++ 编写接收代码，或者编写可以从 C# 代码调用的 C++ DLL 来完成工作）。

第二个选项在某种意义上更可靠，即您使用的正是最初创建坏数据的解码器。另一方面，即使创建一个 DLL 也可能是矫枉过正，更不用说用 C++ 编写整个客户端了。制作一个 DLL，即使使用 C++/CLI，你仍然很难让互操作正常工作，除非你已经是专家。

我对 C++/CLI 很熟悉，但几乎不是专家。我对 C# 更好，所以这里是第一个选项的一些代码：

private const int _khighOffset = 0xD800 - (0x10000 >> 10);

/// <summary>
/// Decodes a nominally UTF8 byte sequence as UTF16. Ignores all data errors
/// except those which prevent coherent interpretation of the input data.
/// Input with invalid-but-decodable UTF8 sequences will be decoded without
/// error, and may lead to invalid UTF16.
/// </summary>
/// <param name="bytes">The UTF8 byte sequence to decode</param>
/// <returns>A string value representing the decoded UTF8</returns>
/// <remarks>
/// This method has not been thoroughly validated. It should be tested
/// carefully with a broad range of inputs (the entire UTF16 code point
/// range would not be unreasonable) before being used in any sort of
/// production environment.
/// </remarks>
private static string DecodeUtf8WithOverlong(byte[] bytes)
{
    List<char> result = new List<char>();
    int continuationCount = 0, continuationAccumulator = 0, highBase = 0;
    char continuationBase = '\0';

    for (int i = 0; i < bytes.Length; i++)
    {
        byte b = bytes[i];

        if (b < 0x80)
        {
            result.Add((char)b);
            continue;
        }

        if (b < 0xC0)
        {
            // Byte values in this range are used only as continuation bytes.
            // If we aren't expecting any continuation bytes, then the input
            // is invalid beyond repair.
            if (continuationCount == 0)
            {
                throw new ArgumentException("invalid encoding");
            }

            // Each continuation byte represents 6 bits of the actual
            // character value
            continuationAccumulator <<= 6;
            continuationAccumulator |= (b - 0x80);
            if (--continuationCount == 0)
            {
                continuationAccumulator += highBase;

                if (continuationAccumulator > 0xffff)
                {
                    // Code point requires more than 16 bits, so split into surrogate pair
                    char highSurrogate = (char)(_khighOffset + (continuationAccumulator >> 10)),
                        lowSurrogate = (char)(0xDC00 + (continuationAccumulator & 0x3FF));

                    result.Add(highSurrogate);
                    result.Add(lowSurrogate);
                }
                else
                {
                    result.Add((char)(continuationBase | continuationAccumulator));
                }
                continuationAccumulator = 0;
                continuationBase = '\0';
                highBase = 0;
            }
            continue;
        }

        if (b < 0xE0)
        {
            continuationCount = 1;
            continuationBase = (char)((b - 0xC0) * 0x0040);
            continue;
        }

        if (b < 0xF0)
        {
            continuationCount = 2;
            continuationBase = (char)(b == 0xE0 ? 0x0800 : (b - 0xE0) * 0x1000);
            continue;
        }

        if (b < 0xF8)
        {
            continuationCount = 3;
            highBase = (b - 0xF0) * 0x00040000;
            continue;
        }

        if (b < 0xFC)
        {
            continuationCount = 4;
            highBase = (b - 0xF8) * 0x01000000;
            continue;
        }

        if (b < 0xFE)
        {
            continuationCount = 5;
            highBase = (b - 0xFC) * 0x40000000;
            continue;
        }

        // byte values of 0xFE and 0xFF are invalid
        throw new ArgumentException("invalid encoding");
    }

    return new string(result.ToArray());
}

我用你的地球字符测试了它，它工作得很好。它还为该字符（即F0 9F 8C 8E）正确解码了正确的 UTF8。如果您打算使用该代码来解码您的所有 UTF8 输入，您当然会希望使用全范围的数据对其进行测试。

c# - How to decode surrogate characters encoded as UTF8?

1 回答 1

Related

Reference