c++ - 被 Unicode、Boost、C++、codecvts 难倒

Question

在 C++ 中，我想使用 Unicode 来做事。因此，在跌入 Unicode 的兔子洞之后，我最终陷入了混乱、头痛和语言环境的火车残骸中。

但是在 Boost 中，我遇到了一个不幸的问题，即尝试使用 Unicode 文件路径并尝试将 Boost 程序选项库与 Unicode 输入一起使用。我已经阅读了有关语言环境、codecvts、Unicode 编码和 Boost 主题的所有内容。

我目前尝试让事情正常工作是有一个 codecvt，它接受一个 UTF-8 字符串并将其转换为平台的编码（POSIX 上的 UTF-8，Windows 上的 UTF-16），我一直试图避免wchar_t.

我实际上得到的最接近的是尝试使用 Boost.Locale 执行此操作，以便在输出时从 UTF-8 字符串转换为 UTF-32 字符串。

#include <string>
#include <boost/locale.hpp>
#include <locale>

int main(void)
{
  std::string data("Testing, 㤹");

  std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
  std::locale toLoc   = boost::locale::generator().generate("en_US.UTF-32");

  typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
  cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);

  std::locale convLoc = std::locale(fromLoc, toCvt);

  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

  // Output is unconverted -- what?

  return 0;
}

我想我使用宽字符进行了其他类型的转换，但我真的不知道我在做什么。我现在不知道什么是适合这项工作的工具。帮助？

score 11 · Accepted Answer

好的，经过漫长的几个月后，我想通了，我想在未来帮助人们。

首先，codecvt 的做法是错误的。Boost.Locale 在其 boost::locale::conv 命名空间中提供了一种在字符集之间进行转换的简单方法。这是一个示例（还有其他不基于语言环境的示例）。

#include <boost/locale.hpp>
namespace loc = boost::locale;

int main(void)
{
  loc::generator gen;
  std::locale blah = gen.generate("en_US.utf-32");

  std::string UTF8String = "Tésting!";
  // from_utf will also work with wide strings as it uses the character size
  // to detect the encoding.
  std::string converted = loc::conv::from_utf(UTF8String, blah);

  // Outputs a UTF-32 string.
  std::cout << converted << std::endl;

  return 0;
}

如您所见，如果将“en_US.utf-32”替换为“”，它将在用户的语言环境中输出。

我仍然不知道如何让 std::cout 一直这样做，但是 Boost.Locale 的 translate() 函数在用户的语言环境中输出。

至于使用 UTF-8 字符串跨平台的文件系统，这似乎是可能的，这里有一个如何做的链接。

score 3 · Accepted Answer

  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

这不会进行转换，~~因为它使用codecvt<char, char, mbstate_t>which is a no-op~~。唯一使用 codecvt 的标准流是文件流。std::cout 根本不需要执行任何转换。

要强制 Boost.Filesystem 在 Windows 上将窄字符串解释为 UTF-8，请使用boost::filesystem::imbue带有 UTF-8 ↔ UTF-16 codecvt facet 的语言环境。Boost.Locale 有后者的实现。

score 3 · Accepted Answer

Boost 文件系统 iostream 替换类在与 Visual C++ 一起使用时可以与 UTF-16 一起正常工作。

但是，当在 Windows 中与 g++ 一起使用时，它们不起作用（在支持任意文件名的意义上）——至少从 Boost 版本 1.47 开始。有一个代码注释解释了这一点；本质上，Visual C++ 标准库提供了 Boost 文件系统类使用的基于非标准wchar_t的构造函数，但 g++ 不支持这些扩展。

一种解决方法是使用 8.3短文件名，但这种解决方案有点脆弱，因为在旧 Windows 版本中，用户可以关闭自动生成短文件名。

在 Windows 中使用 Boost 文件系统的示例代码：

#include "CmdLineArgs.h"        // CmdLineArgs
#include "throwx.h"             // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )

#include <boost/filesystem/fstream.hpp>     // boost::filesystem::ifstream
#include <iostream>             // std::cout, std::cerr, std::endl
#include <stdexcept>            // std::runtime_error, std::exception
#include <string>               // std::string
#include <stdlib.h>             // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;

inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }

int main()
{
    try
    {
        CmdLineArgs const   args;
        wstring const       programPath     = args.at( 0 );

        hopefully( args.nArgs() == 2 )
            || throwX( "Usage: " + ansi( programPath ) + " FILENAME" );

        wstring const       filePath        = args.at( 1 );
        bfs::ifstream       stream( filePath );     // Nice Boost ifstream subclass.
        hopefully( !stream.fail() )
            || throwX( "Failed to open file '" + ansi( filePath ) + "'" );

        string line;
        while( getline( stream, line ) )
        {
            cout << line << endl;
        }
        hopefully( stream.eof() )
            || throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );

        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}

c++ - 被 Unicode、Boost、C++、codecvts 难倒

3 回答 3

Related

Reference