c++ - 读取/存储不同类型的字符串（utf8/utf16/ansi）

Question

我正在解析一个文件，其中包含不同编码的各种字符串。这些字符串的存储方式是这样的：

0xFF 0xFF - block header                   2 bytes
0xXX 0xXX - length in bytes                2 bytes
0xXX      - encoding (can be 0, 1, 2, 3)   1 byte
...       - actual string                  num bytes per length

这通常很容易，但是我不确定如何处理编码。编码可以是以下之一：

0x00 - regular ascii string (that is, actual bytes represent char*)
0x01 - utf-16 with BOM (wchar_t* with the first two bytes being 0xFF 0xFE or 0xFE 0xFF)
0x02 - utf-16 without BOM (wchar_t* directly)
0x03 - utf-8 encoded string (char* to utf-8 strings)

我需要以某种方式阅读/存储它。最初我在考虑简单string，但这不适用于wchar_t*. 然后我考虑将所有内容都转换为wstring，但这将是相当多的不必要的转换。接下来想到的是boost::variant<string, wstring>（我已经boost::variant在代码的另一个地方使用了）。在我看来，这是一个合理的选择。所以现在我有点难以解析它。我正在考虑以下几点：

//after reading the bytes, I have these:
int length;
char encoding;
char* bytes;

boost::variant<string, wstring> value;
switch(encoding) {
    case 0x00:
    case 0x03:
        value = string(bytes, length);
        break;
    case 0x01:
        value = wstring(??);
        //how do I use BOM in creating the wstring?
        break;
    case 0x02:
        value = wstring(bytes, length >> 1);
        break;
    default:
        throw ERROR_INVALID_STRING_ENCODING;
}

由于我只是稍后打印这些字符串，因此我可以将 UTF8 存储在一个简单的文件中string而无需太多麻烦。

我的两个问题是：

这种方法是否合理（即使用 boost::variant）？
如何wstring使用特定的 BOM 创建？

score 0 · Accepted Answer

UTF16 需要区分 LE 和 BE。

我怀疑0x02 - utf-16 without BOM (wchar_t* directly)实际上是 UTF16 BE。With BOM编码方式 LE/BE 由 BOM 指示。

C++ 标准库的 Unicode 支持非常有限，我认为 vanilla C++ 不能正确处理 UTF16LE/BE，更不用说 UTF8。许多 Unicode 应用程序使用ICU等 3rd 方支持库。

对于内存中的表示，我会坚持使用 std::string。因为 std::string 可以表示任何文本编码，而 std::wstring 对这种多重编码情况没有多大帮助。如果您需要使用 std::wstring 和相关的 std::iostream 函数，请注意系统区域设置和 std::locale 设置。

Mac OS X 使用 UTF8 作为唯一的默认文本编码，而 Windows 使用 UTF16 LE。我认为，您在内部也只需要一种文本编码，再加上几个转换功能就可以达到您的目的。

score 0 · Accepted Answer

经过一些研究、尝试和错误，我决定使用 UTF8-CPP，它是一个轻量级的、仅包含标头的函数集，用于与 utf8 进行转换。它包括从 utf-16 转换为 utf-8 的功能，据我了解，它可以正确处理 BOM。

然后我将所有字符串存储为std::string，将 utf-16 字符串转换为 utf-8，如下所示（来自我上面的示例）：

整数长度；字符编码；字符*字节；

string value;
switch(encoding) {
    case 0x00:
    case 0x03:
        value = string(bytes, length);
        break;
    case 0x01:
    case 0x02:
        vector<unsigned char> utf8;
        wchar_t* input = (wchar_t*)bytes;
        utf16to8(input, input + (length >> 1), back_inserter(utf8));
        value = string(utf8.start(), utf8.end());
        break;
    default:
        throw ERROR_INVALID_STRING_ENCODING;
}

这在我的快速测试中效果很好。在最终判断之前，我需要做更多的测试。

c++ - 读取/存储不同类型的字符串（utf8/utf16/ansi）

2 回答 2

Related

Reference