c++ - 如何从 C++ 中的二进制文件中获取 utf-8 字符数？

Question

score 1 · Accepted Answer

The byte sequence you're showing is the UTF-8 encoded version of the character.

You need to decode the UTF-8 to get to the Unicode code point.

For this exact sequence of bytes, the following bits make up the code point:

11100011 10000010 10100010
    ****   ******   ******

So, concatenating the asterisked bits we get the number 0011000010100010, which equals 0x30a2 or 12450 in decimal.

See the Wikipedia description for details on how to interpret the encoding.

In a nutshell: if bit 7 is set in the first byte, the number of adjacent bits (call it m) that are also set (2) gives the number of bytes that follow for this code point. The number of bits to extract from each byte is (8 - 1 - 1 - m) for the first byte, and 6 bits from each subsequent byte. So here we got (8 - 1 - 1 - 2) = 4 + 2 * 6 = 16 bits.

As pointed out in comments, there are plenty of libraries for this, so you might not need to implement it yourself.

score 1 · Accepted Answer

working from the wikipedia page, I came up with this:

unsigned utf8_to_codepoint(const char* ptr) {
    if( *ptr < 0x80) return *ptr;
    if( *ptr < 0xC0) throw unicode_error("invalid utf8 lead byte");
    unsigned result=0;
    int shift=0;
    if( *ptr < 0xE0) {result=*ptr&0x1F; shift=1;}
    if( *ptr < 0xF0) {result=*ptr&0x0F; shift=2;}
    if( *ptr < 0xF8) {result=*ptr&0x07; shift=3;}
    for(; shift>0; --shift) {
        ++ptr;
        if (*ptr<0x7F || *ptr>=0xC0) 
            throw unicode_error("invalid utf8 continuation byte");
        result <<= 6;
        result |= *ptr&0x6F;
    }
    return result;
}

Note that this is a very poor implementation (I highly doubt it even compiles), and parses a lot of invalid values that it probably shouldn't. I put this up merely to show that it's a lot harder than you'd think, and that you should use a good unicode library.

c++ - 如何从 C++ 中的二进制文件中获取 utf-8 字符数？

2 回答 2

Related

Reference