1

我正在尝试将 Unicode 代码点转换为百分比编码的 UTF-8 代码单元。

Unicode -> UTF-8 转换似乎工作正常,如一些印地语和中文字符的测试所示,这些字符在使用 UTF-8 编码的 Notepad++ 中正确显示,并且可以正确翻译回来。

我认为百分比编码就像在每个 UTF-8 代码单元前面添加 '%' 一样简单,但这并不完全有效。而不是预期的%E5%84%A3,我看到的是%xE5%x84%xA3(对于unicode U+5123)。

在此处输入图像描述

我究竟做错了什么?

添加代码(注意 utf8.h 属于 UTF8-CPP 库)。

#include <fstream>
#include <iostream>
#include <vector>
#include "utf8.h"

std::string unicode_to_utf8_units(int32_t unicode)
{
    unsigned char u[5] = {0,0,0,0,0};
    unsigned char *iter = u, *limit = utf8::append(unicode, u);
    std::string s;
    for (; iter != limit; ++iter) {
        s.push_back(*iter);
    }
    return s;
}

int main()
{
    std::ofstream ofs("test.txt", std::ios_base::out);
    if (!ofs.good()) {
        std::cout << "ofstream encountered a problem." << std::endl;
        return 1;
    }

    utf8::uint32_t unicode = 0x5123;
    auto s = unicode_to_utf8_units(unicode);
    for (auto &c : s) {
        ofs << "%" << c;
    }

    ofs.close();

    return 0;
}
4

1 回答 1

3

You actually need to convert byte values to the corresponding ASCII strings, for example:

"é" in UTF-8 is the value { 0xc3, 0xa9 }. Please not that these are bytes, char values in C++.

Each byte needs to be converted to: "%C3" and "%C9" respectively.

The best way to do so is to use sstream:

std::ostringstream out;
std::string utf8str = "\xE5\x84\xA3";

for (int i = 0; i < utf8str.length(); ++i) {
    out << '%' << std::hex << std::uppercase << (int)(unsigned char)utf8str[i];
}

Or in C++11:

for (auto c: utf8str) {
    out << '%' << std::hex << std::uppercase << (int)(unsigned char)c;
}

Please note that the bytes need to be cast to int, because else the << operator will use the litteral binary value. First casting to unsigned char is needed because otherwise, the sign bit will propagate to the int value, causing output of negative values like FFFFFFE5.

于 2013-10-06T17:51:34.060 回答