c - 从 UTF-8 字符串中查找非 ascii 字符

Question

我需要从 UTF-8 字符串中找到非 ASCII 字符。

我的理解：UTF-8 是字符编码的超集，其中 0-127 是 ascii 字符。因此，如果在 UTF-8 字符串中，字符值不在 0-127 之间，那么它不是 ascii 字符，对吧？如果我在这里错了，请纠正我。

基于上述理解，我用 C 编写了以下代码：

注意：我使用 Ubuntu gcc 编译器来运行C代码

utf 字符串是x√ab c

long i;
    char arr[] = "x√ab c";
    printf("length : %lu \n", sizeof(arr));
        for(i=0; i<sizeof(arr); i++){

        char ch = arr[i];
        if (isascii(ch))
             printf("Ascii character %c\n", ch);
              else
             printf("Not ascii character %c\n", ch);
    }

打印输出如下：

length : 9 
Ascii character x
Not ascii character 
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character  
Ascii character c
Ascii character

x√ab c的肉眼长度似乎是 6，但在代码中它是 9 ？x√ab c的正确答案是 1 ...即它只有 1 个非 ascii 字符，但在上面的输出中它是 3 （乘以非 ascii 字符）。

如何正确地从 UTF-8 字符串中找到非 ascii 字符。

请指导主题。

score 7 · Accepted Answer

C 所称的 achar实际上是一个字节。一个 UTF-8 字符可以由几个字节组成。

事实上，在 UTF-8 中只有ASCII 字符由单个字节表示（这就是为什么所有有效的 ASCII 编码文本也有效地 UTF-8 编码的原因）。

因此，要计算 UTF-8 字符的数量，您必须进行部分解码：计算 UTF-8起始代码点的数量。

请参阅有关 UTF-8 的 Wikipedia 文章以了解它们的编码方式。

基本上有3类：

单字节代码 0b0xxxxxxx
起始字节：0b110xxxx、0b1110xxxx、0b11110xxx
连续字节：0b10xxxxxx

要计算 unicode 代码点的数量，只需计算所有非连续字节的字符。

但是，unicode 代码点并不总是与“字符”一一对应（取决于您对字符的确切定义）。

score 3 · Accepted Answer

UTF-8 字符在字符数组中的占用方式是，每个 UTF-8 字符占用的第一个字节将包含有关用于表示字符的字节数的信息。从第一个字节的 MSB 开始的连续 1 的数量将表示非 ascii 字符占用的总字节数。在“√”的情况下，二进制形式为：11100010,10001000,10011010。计算第一个字节中 1 的个数得出占用的字节数为 3。类似下面的代码的代码可以解决这个问题：

int get_count(char non_ascii_char){
        /* 
           The function returns the number of bytes occupied by the UTF-8 character
           It takes the non ASCII character as the input and returns the length 
           to the calling function.
        */
        int bit_counter=7,count=0;
        /*
           bit_counter -  is the counter initialized to traverse through each bit of the 
           non ascii character
           count - stores the number of bytes occupied by the character
        */

        for(;bit_counter>=0;bit_counter--){
            if((non_ascii_char>>bit_counter)&1){
                count++;// increments on the number of consecutive 1s in the byte
            }
            else{
                break;// breaks on encountering the first 0
            }
        }

        return count;// returns the count to the calling function
    }

c - 从 UTF-8 字符串中查找非 ascii 字符

2 回答 2

Related

Reference