5

我在内存中有一些 UTF-8 字符串(这是更大系统的一部分),它们基本上是欧洲国家的地方名称。我想要做的是将它们写入文本文件。我在我的 Linux 机器(Fedora)上。因此,当我将这些名称字符串(字符指针)写入文件时,文件将以扩展的 ASCII 格式保存。

现在我将此文件复制到我需要将这些名称加载到 mySQL DB 的 Windows 机器上。当我在记事本++上打开文本文件时,它再次默认编码为ANSI。但我可以选择编码为 UTF-8,除了以下 3 个字符之外,几乎所有字符看起来都符合预期:- Ő、ő 和 ű。它们在文本中显示为 Ő、ő 和 ű。

有没有人对可能出了什么问题有任何想法。我知道这些不是扩展 ASCII 符号的一部分。但是我将其写入文件的方式类似于:

// create out file stream
std::ofstream fs("sample.txt");

// loop through utf-8 formatted string list
if(fs.is_open()) {
    for(int i = 0; i < num_strs; i++) {
        fs << str_name; // unsigned char pointer representing name in utf-8 format
        fs << "\n";
    }
}
fs.close();

即使使用 ú 和 ö 和 ß 这样的字符,一切看起来都很好。问题仅在于上述 3 个字符。对此有何想法/建议/评论?谢谢!

例如,像“Gyömrő”这样的字符串显示为“Gyömrű”。

4

3 回答 3

3

您需要确定在哪个阶段引入了意外的 Ő HTML 实体。我最好的猜测是,它们已经在您正在写入文件的字符串中。使用调试器或添加对字符串中的 &s 进行计数的测试代码。

这意味着,您的信息来源并不严格将 UTF-8 用于非 ASCII 字符,但偶尔会使用HTML 实体。这很奇怪,但如果您的数据源是 HTML 文件(或类似文件),则可能。

此外,您可能希望以 HEX 模式查看输出文件。(Notepad++ 有一个不错的插件)这可能有助于您理解 UTF-8 在字节级别上的真正含义:128 个 ASCII 符号使用值 0-127 的一个字节。其他符号使用 2-6 个字节(我认为),其中第一个字节必须大于 127。HTML 实体并不是真正的编码,更像是 '\n' '\r' 之类的转义序列。

于 2012-09-21T22:30:22.520 回答
1

If, when opening in Notepad++ and choosing UTF-8, and your characters aren't showing up propery, then they are not encoded as UTF-8. You also mention "extended ASCII", which has very little to do with unicode encodings. And my belief is that you are in fact writing your characters as some codepage, for instance "ISO-8859-1".

Try take a look at the byte count of those trouble strings indide your program, and if the byte count is the same as the character count, then you are in fact not encoding it as UTF-8.

Any character that lies outside of the 128 character ASCII table, will be encoded with at least two bytes in UTF-8.

To properly handle unicode within your C++ application, take a look at ICU: http://site.icu-project.org/

于 2012-09-23T20:24:40.670 回答
-1

The default std::codecvt<char, char, mbstate_t> doesn't do you any good: this is defined to do no conversion at all. You'd need to imbue() a std::locale with a UTF-8 aware code conversion facet. That said, char can't really represent Unicode values. You'd need a bigger type although the values you are looking at actually do fit into a char in Unicode but not in any encoding which allows for all values.

The C++ 2011 standard defines a UTF-8 conversion facet std::codecvt_utf<...>. However, it isn't specialized for the internal type char but only for wchar_t, uint16_t, and uint32_t. Using clang together with libc++, I could get the following to do the right things:

#include <fstream>
#include <locale>
#include <codecvt>

int main()
{
    std::wofstream out("utf8.txt");
    std::locale utf8(std::locale(), new std::codecvt_utf8<wchar_t>());
    out.imbue(utf8);
    out << L"\xd6\xf6\xfc\n";
    out << L"Ööü\n";
}

Note that this code use wchar_t rather than char. It might look reasonable to use char16_t or char32_t because these are meant to be UCS2 and UCS4 encoded, respectively (if I understand the standard correctly), but there are not stream type defined for them. Setting stream types up for a new character type is somewhat of a pain.

于 2012-09-21T22:51:25.617 回答