c++ - 为什么 mingw-w64 上的`std::mbrlen` 总是返回一个（`1`）

Question

当我在 mingw-w64 中编译以下源代码时，我总是从以下位置获得 1（一个）字节std::mbrlen：

#include <cstddef>
#include <cstdio>
#include <clocale>
#include <cstring>
#include <cwchar>

void print_mb(const char* ptr)
{
  std::size_t index{0};
  const char* end = ptr + std::strlen(ptr);
  int len;
  while((len = std::mbrlen(ptr, end-ptr, nullptr)) > 0)
  {
    std::printf("Character #%zu is %i bytes long.\n", index++, len);
    ptr += len;
  }
}

int main()
{
  std::setlocale(LC_ALL, "en_US.utf8");
  const char* str = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b";
  print_mb(str);
}

示例代码基于std::mbrtowc页面中的代码

在我在 mingw-w64 下编译了这个示例之后

gcc sample.cxx

我从程序中得到以下输出：

Character #0 is 1 bytes long.
Character #1 is 1 bytes long.
Character #2 is 1 bytes long.
Character #3 is 1 bytes long.
Character #4 is 1 bytes long.
Character #5 is 1 bytes long.
Character #6 is 1 bytes long.
Character #7 is 1 bytes long.
Character #8 is 1 bytes long.
Character #9 is 1 bytes long.

但是，如果我使用cppreference 页面上的“在线”编译器编译相同的代码，或者使用 Arch Linux 下的 GCC（再次使用 simple gcc sample.cxx），或者使用 Microsoft Visual C++ 2017（cl sample.cxx），或者使用 Intel C++ 编译器 2018（icl sample.cxx），我明白了：

Character #0 is 1 bytes long.
Character #1 is 2 bytes long.
Character #2 is 3 bytes long.
Character #3 is 4 bytes long.

std::mbrlen什么可能导致mingw-w64下的这种行为？谢谢。

我的 Microsoft Windows 主机是 Microsoft Windows 10 x86-64。mingw-w64、Microsoft Visual C++和Intel C++下编译在此主机上进行。

score 0 · Accepted Answer

Windows 不通过 C 和 C++ 语言环境支持 utf8。

https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

可用的区域设置名称、语言、国家/地区代码和代码页集包括 Windows NLS API 支持的所有内容，但每个字符需要两个以上字节的代码页除外，例如 UTF-7 和 UTF-8。

此外，Windows 上的语言环境名称与 Linux 上的不同，例如setlocale( LC_ALL, "English_United States.1252" );

C 和 C++ 语言环境系统是实现定义的，唯一可用的实现是 Linux (glibc) 中的实现。

在 Windows 上，如果您想要 UTF-8 或其他 Unicode 内容，您需要求助于 Windows API 或其他库。

c++ - 为什么 mingw-w64 上的`std::mbrlen` 总是返回一个（`1`）

1 回答 1

Related

Reference