c++ - C++ unicode 问题

Question

我知道 ICU 和代码项目中的 utf8 之类的小型库（忘记确切名称），但是这些都不是我想要的。

我真正想要的是像重症监护室这样的东西，但以更友好的方式包裹起来。

具体来说：

完全面向对象
c++ 标准流的实现，或者至少执行相同角色的东西。
可以以与地区相关的方式格式化时间、日期等（例如，英国的 dd/mm/yy 和美国的 mm/dd/yy）。
让我选择字符串的“内部”编码，例如，我可以让它在 windows 上使用 UTF-16，以避免在 windows API 和 DirectX 之间传递字符串时进行大量转换
在编码之间轻松转换字符串

如果不存在这样的库，是否可以使用标准 c++ 类来包装 ICU，因此我可以创建一个与 std::string 和 std::wstring 具有相同用法的 ustring，并且还可以实现流的版本（最好让它们与现有的完全兼容，即我可以将它传递给一个期望 std::ostream 的函数，它会在其内部格式和 ascii（或 utf-8）之间进行转换）？假设有可能做多少工作？

编辑：还查看了 c++0x 标准并注意到 utf8、utf16 和 utf32 的文字，这是否意味着标准库（例如字符串、流等）将完全支持这些编码以及它们之间的转换？如果是这样，有人知道 Visual Studio 支持这些功能需要多长时间？

EDIT2：至于使用现有的 c++ 支持，我将查找语言环境和方面的内容。

我遇到的一个问题是，当使用围绕 wchar_t 定义的流时，它在 windows 下是 2 个字节用于文件 i/o，但是它似乎仍然对它们自己的文件使用 ascii。

std::wofstream file(L"myfile.txt", std::ios::out);
file << L"Hello World!" << std::endl;

在文件48 65 6C 6C 6F 20 57 6F 72 6C 64 0D 0A中产生以下十六进制，
这显然是 ascii 而不是预期的 utf-16 输出：
FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 0D 00 0A 00

score 3 · Accepted Answer

我真正想要的是像ICU这样的东西，但以更友好的方式包裹起来

不幸的是，没有这样的事情。他们的 API 并没有那么糟糕，所以你可以通过一些努力来适应它。

可以以与地区相关的方式格式化时间、日期等（例如，英国的 dd/mm/yy 和美国的 mm/dd/yy）。

课堂上完全支持它std::locale，请阅读如何使用它。您还可以指定语言环境，std::iostream以便正确格式化数字、日期。

在编码之间轻松转换字符串

std::locale提供用于将 8 位本地编码转换为宽一并返回的方面。

所以我可以例如让它使用 UTF-16

ICU 内部使用 utf-16，win32wchar_t和 wstring 也使用 utf-16，在其他操作系统下，大多数实现将 wchar_t 指定为 utf-32，wstring 使用 utf-32。

备注：的支持std::locale并不完美，但它已经提供了许多对字符操作有用的工具。

见：http ://www.cplusplus.com/reference/std/locale/

score 2 · Accepted Answer

这就是我使用 ICU 在 std::string （在 UTF-8 中）和 std::wstring 之间转换的方式

/** Converts a std::wstring into a std::string with UTF-8 encoding.
 */
template < typename StringT >
StringT utf8 ( std::wstring const & rc_string );

/** Converts a std::String with UTF-8 encoding into a std::wstring.
 */
template < typename StringT >
StringT utf8 ( std::string const & rc_string );

/** Nop specialization for std::string.
 */
template < >
inline std::string utf8 ( std::string const & rc_string )
{
  return rc_string;
}

/** Nop specialization for std::wstring.
 */
template < >
inline std::wstring utf8 ( std::wstring const & rc_string )
{
  return rc_string;
}

template < >
std::string utf8 ( std::wstring const & rc_string )
{
  std::string result;
  if(rc_string.empty())
    return result;

  std::vector<UChar> buffer;

  result.resize(rc_string.size() * 3); // UTF-8 uses max 3 bytes per char
  buffer.resize(rc_string.size() * 2); // UTF-16 uses max 2 bytes per char

  UErrorCode status = U_ZERO_ERROR;
  int32_t len = 0;

  u_strFromWCS(
    &buffer[0],
    buffer.size(),
    &len,
    &rc_string[0],
    rc_string.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strFromWCS failed");
  }
  buffer.resize(len);

  u_strToUTF8(
    &result[0],
    result.size(),
    &len,
    &buffer[0],
    buffer.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strToUTF8 failed");
  }
  result.resize(len);

  return result;
}/* end of utf8 ( ) */


template < >
std::wstring utf8 ( std::string const & rc_string )
{
  std::wstring result;
  if(rc_string.empty())
    return result;

  std::vector<UChar> buffer;

  result.resize(rc_string.size());
  buffer.resize(rc_string.size());

  UErrorCode status = U_ZERO_ERROR;
  int32_t len = 0;

  u_strFromUTF8(
    &buffer[0],
    buffer.size(),
    &len,
    &rc_string[0],
    rc_string.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strFromUTF8 failed");
  }
  buffer.resize(len);

  u_strToWCS(
    &result[0],
    result.size(),
    &len,
    &buffer[0],
    buffer.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strToWCS failed");
  }
  result.resize(len);

  return result;
}/* end of utf8 ( ) */

使用它就这么简单：

std::string s = utf8<std::string>(std::wstring(L"some string"));
std::wstring s = utf8<std::wstring>(std::string("some string"));

score 2 · Accepted Answer

我总是这样工作：

某些编码的字节流 -> ICU-> wistream -> stl & boost -> wostream -> ICU-> 某些编码的字节流

score 1 · Accepted Answer

可以通过指定特定的语言环境来格式化日期、时间等。至于你自己的滚动——总是有可能的，你可以根据需要从底层库中获取尽可能多或尽可能少的内容。

还查看了 c++0x 标准并注意到 utf8、utf16 和 utf32 的文字，这是否意味着标准库（例如字符串、流等）将完全支持这些编码以及它们之间的转换？

是的。但请注意，这些是不同的数据类型，而不是您的常规wchar序列或wstring.

如果是这样，有人知道 Visual Studio 支持这些功能需要多长时间？

据我所知：vc9 (VS2008) 仅部分支持某些 TR1 功能。vc10（VS2010）预计会有更好的支持。

score -1 · Accepted Answer

-1

我做了我自己的小包装。如果你愿意，我可以分享。

于 2009-05-07T16:41:08.063 回答

score -1 · Accepted Answer

倒霉。我知道 Dinkumware 库提供了一些 Unicode 支持——你可以查看他们网站上的文档。AFAIK，它不是免费的。

c++ - C++ unicode 问题

6 回答 6

Related

Reference