c - GCC 和 CLang 无法识别 Unicode 字符串

Question

我正在向 GCC 传递一个 UTF-32 字符串，它抱怨无效的多字节或宽字符。

我在 Clang 中对此进行了测试，并且收到了相同的错误消息。

我最初使用 MSVC 编写了该语句，它运行良好。

这是断言声明。

 assert(utf_string_copy_utf32(&string, U"¿Cómo estás?") == 0);

这是声明。

int utf_string_copy(struct utf_string * a, const char32_t * b);

这是编译命令：

cc -Wall -Wextra -Werror -Wfatal-errors -g -I ../include -fexec-charset=UTF-32 string-test.c libutf.a -o string-test

我是否假设 GCC 只能通过转义序列识别 Unicode 字符？

还是我误解了 GCC 和 CLang 如何识别这些字符。

编辑 1

这是错误消息。

string-test.c: In function ‘test_copy’:
string-test.c:46:61: error: converting to execution character set: Invalid or incomplete multibyte or wide character
assert(utf_string_copy_utf32(&string, U"�C�mo est�s?") == 0);

编辑 2

现在我更加困惑，因为我试图在一个较小的示例中重新创建错误。

#include <uchar.h>
#include <stdlib.h>
#include <stdio.h>

static size_t test_utf8(const char * in){
    size_t len;
    for (len = 0; in[len]; len++);
    return len;
}

static size_t test_utf32(const char32_t * in){
    size_t len;
    for (len = 0; in[len]; len++);
    return len;
}

int main(void){
    size_t len;

    len = test_utf8(u8"¿Cómo estás?");
    printf("utf-32 length: %lu\n", len);

    len = test_utf32(U"¿Cómo estás?");
    printf("utf-32 length: %lu\n", len);

    return 0;
}

这打印：

utf-8 length: 15
utf-32 length: 12

这再次证实了我最初认为它的工作方式。

所以我想这意味着我正在使用的库代码中的某个地方存在问题。但我仍然不知道发生了什么。

score 2 · Accepted Answer

我弄清楚了这个问题。

我对两个字符串文字（在原始代码中中断的字符串文字和正在工作的字符串文字）进行了十六进制转储。

这是损坏的字符串文字（我在 Windows 上写的）：

00000000: 5522 bf43 f36d 6f20 6573 74e1 733f 220a  U".C.mo est.s?".

这是工作字符串文字（我在 Ubuntu 机器上写的）：

00000000: 5522 c2bf 43c3 b36d 6f20 6573 74c3 a173  U"..C..mo est..s
00000010: 3f22 0a                                  ?".

尽管它们在代码编辑器中看起来完全相同，并且尽管它们都有U前缀，但它们在源代码中的编码方式不同。

虽然我不太确定哪个编码是哪个，但我从中得出，检查文字的源代码编码非常非常重要。

编辑 1

正如@melpomene 在评论中指出的那样：

损坏的编码是Windows-1252。

工作编码是UTF-8。

c - GCC 和 CLang 无法识别 Unicode 字符串

1 回答 1

Related

Reference