c++ - 如何从 unicode 字符串中获取单个字符并比较、打印它们？

Question

我正在使用libunistring处理 C 中的 unicode 字符串。无法使用其他库。我的目标是从 unicode 字符串的索引位置读取单个字符，打印它，并将其与固定值进行比较。这应该很简单，但是...

这是我的尝试（完整的 C 程序）：

/* This file must be UTF-8 encoded in order to work */

#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include <unitypes.h>
#include <uniconv.h>
#include <unistdio.h>
#include <unistr.h>
#include <uniwidth.h>


int cmpchr(const char *label, const uint32_t charExpected, const uint32_t charActual) {
    int result = u32_cmp(&charExpected, &charActual, 1);
    if (result == 0) {
        printf("%s is recognized as '%lc', good!\n", label, charExpected);
    } else {
        printf("%s is NOT recognized as '%lc'.\n", label, charExpected);
    }
    return result;
}


int main() {
    setlocale(LC_ALL, "");     /* switch from default "C" encoding to system encoding */
    const char *enc = locale_charset();
    printf("Current locale charset: %s (should be UTF-8)\n\n", enc);

    const char *buf = "foo 楽あり bébé";
    const uint32_t *mbcs = u32_strconv_from_locale(buf);

    printf("%s\n", u32_strconv_to_locale(mbcs));

    uint32_t c0 = mbcs[0];
    uint32_t c5 = mbcs[5];
    uint32_t cLast = mbcs[u32_strlen(mbcs) - 1];

    printf(" - char 0: %lc\n", c0);
    printf(" - char 5: %lc\n", c5);
    printf(" - last  : %lc\n", cLast);

    /* When this file is UTF-8-encoded, I'm passing a UTF-8 character
     * as a uint32_t, which should be wrong! */
    cmpchr("Char 0", 'f', c0);
    cmpchr("Char 5", 'あ', c5);
    cmpchr("Last char", 'é', cLast);

    return 0;
}

为了运行这个程序：

将程序保存到名为ustridx.c的 UTF-8 编码文件
sudo apt-get install libunistring-dev
gcc -o ustridx.o -W -Wall -O -c ustridx.c ; gcc -o ustridx -lunistring ustridx.o
确保终端设置为 UTF-8 语言环境 ( locale)
运行它./ustridx

输出：

Current locale charset: UTF-8 (should be UTF-8)

foo 楽あり bébé
 - char 0: f
 - char 5: あ
 - last  : é
Char 0 is recognized as 'f', good!
Char 5 is NOT recognized as '�����'.
Last char is NOT recognized as '쎩'.

期望的行为是正确识别char 5和last char，并在输出的最后两行正确打印。

score 1 · Accepted Answer

从 libunistring 的文档中：

 Compares S1 and S2, each of length N, lexicographically.  Returns a
 negative value if S1 compares smaller than S2, a positive value if
 S1 compares larger than S2, or 0 if they compare equal.

声明中的比较if是错误的。这就是不匹配的原因。当然，这揭示了其他不相关的问题，也需要解决。但是，这就是比较结果令人费解的原因。

score 1 · Accepted Answer

'あ'并且'é'是无效的字符文字。字符文字中只允许使用来自基本源字符集和转义序列的字符。

然而，GCC 会发出警告（参见 Godbolt）说warning: multi-character character constant。这是另一种情况，是关于字符常量，例如'abc'，它们是多字符文字。这是因为这些字符是使用 UTF-8 的多个字节编码的。根据cppreference，这种文字的值是实现定义的，所以你不能依赖它的值是相应的 Unicode 代码点。GCC 特别不这样做，如此处所示。

从 C11 开始，您可以使用UTF-32 字符文字，例如U'あ'它会导致char32_t字符的 Unicode 代码点的值。尽管根据我的阅读标准不允许在文字中使用诸如あ之类的字符，但cppreference上的示例似乎表明编译器通常允许这样做。
符合标准的可移植解决方案是对字符文字使用 Unicode 转义序列，例如U'\u3042'あ，但这与使用整数常量（例如0x3042.

c++ - 如何从 unicode 字符串中获取单个字符并比较、打印它们？

2 回答 2

Related

Reference