c++ - 如何使用 boost::spirit 解析 UTF-8？

Question

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/support_standard_wide.hpp>

void parse_simple_string()
{
    namespace qi = boost::spirit::qi;    
    namespace encoding  = boost::spirit::unicode;
    //namespace stw = boost::spirit::standard_wide;

    typedef std::wstring::const_iterator iterator_type;

    std::vector<std::wstring> result;
    std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)";

    qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\""));
    qi::phrase_parse(input.begin(), input.end(),
                     key % qi::lit(L"\",\""),
                     encoding::space,
                     result);

    //std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t>  (std::wcout, L"\n"));
    for(auto const &data : result) std::wcout<<data<<std::endl;
}

我研究了这篇文章How to use Boost Spirit to parse Chinese(unicode utf-16)? 并按照指南进行操作，但无法解析“你好”这个词

预期的结果应该是

12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP 你好

但实际结果是 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP

无法解析中文单词“你好”

操作系统是win7 64bits，我的编辑器将文字保存为UTF-8

score 9 · Accepted Answer

如果您在输入时使用 UTF-8，那么您可以尝试使用Boost.Regex 中的 Unicode 迭代器。

例如，使用 boost::u8_to_u32_iterator：

一个双向迭代器适配器，它使底层的 UTF8 字符序列看起来像一个（只读）UTF32 字符序列。

现场演示

#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>

int main()
{
    using namespace boost;
    using namespace spirit::qi;
    using namespace std;

    auto &&utf8_text=u8"你好，世界！";
    u8_to_u32_iterator<const char*>
        tbegin(begin(utf8_text)), tend(end(utf8_text));

    vector<uint32_t> result;
    parse(tbegin, tend, *standard_wide::char_, result);
    for(auto &&code_point : result)
        cout << "&#" << code_point << ";";
    cout << endl;
}

输出是：

&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0;

score 1 · Accepted Answer

虽然 Evgeny Panasyuk 的答案是正确的，但u8_to_u32_iterator如果输入字符串不是 NUL 终止的，由于缓冲区溢出错误，使用可能不安全。考虑如下示例：

文件 foobar.cpp

#include "boost/regex/pending/unicode_iterator.hpp"
#include <iostream>

int main() {
    const char contents[] = {'H', 'e', 'l', 'l', 'o', '\xF1'};

    using utf8_iter = boost::u8_to_u32_iterator<const char *>;
    auto iter = utf8_iter{contents};
    auto end = utf8_iter{contents + sizeof(contents)};

    for (; iter != end; ++iter)
        std::cout << *iter << '\n';
}

当使用命令编译 clang++ -g -fsanitize=address -std=c++17 -I path/to/boost/ -o foobar foobar.cpp然后运行时，clang address sanitizer 将显示stack-buffer-overflow错误。发生错误是因为缓冲区中的最后一个字符是 4 字节 UTF-8 序列的前导字节 => 迭代器在它之后继续读取字节 ==> 缓冲区溢出。

如果最后一个字节是 NUL const char contents[] = "Hello\xF1";，迭代器将在读取 NUL 字符时检测到编码错误并中止下一次读取 ==> 我们将有未捕获的异常而不是未定义的行为。

简而言之，使用前请确保输入为 NUL 终止，boost::u8_to_u32_iterator否则您可能会遇到 UB。

c++ - 如何使用 boost::spirit 解析 UTF-8？

2 回答 2

Related

Reference