如果我仅从命令行(使用 linux 命令)将所有文件转换为 utf-8,我是否会丢失信息?
不,所有 UTF-16 数据都可以无损转换为 UTF-8。这可能是最好的做法。
当引入宽字符时,它们旨在成为专门用于程序内部的文本表示,并且永远不会作为宽字符写入磁盘。宽流通过将您写出的宽字符转换为输出文件中的窄字符,并在读取时将文件中的窄字符转换为内存中的宽字符来反映这一点。
std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).
std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.
当然,实际编码取决于codecvt
流的灌输语言环境中的方面,但流所做的是在写入时使用codecvt
从转换wchar_t
为char
使用该方面,并在读取时转换char
为使用。wchar_t
然而,自从有些人开始用 UTF-16 写出文件后,其他人就不得不处理它了。他们使用 C++ 流的方式是创建codecvt
将被char
视为持有一半 UTF-16 代码单元的方面,这就是这样codecvt_utf16
做的。
因此,通过这种解释,您的代码存在以下问题:
std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
std::wstring line;
std::getline(file2, line);
std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}
这是重写上述内容的一种方法:
// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);
// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");
// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
std::locale(),
new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);
// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));
// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work
std::wstring line;
while (std::getline(file2, line)) {
std::wcout << line << std::endl;
}