c++ - 如何将 boost::spirit::lex 令牌的值从 iterator_range 转换为字符串？

Question

当我尝试从 iterator_range 转换令牌的值时，词法分析器在尝试读取下一个令牌时失败。

这是包含令牌定义的 Tokens 结构：（我认为这不相关，但我包括在内以防万一。）

template <typename Lexer>
struct Tokens : boost::spirit::lex::lexer<Lexer>
{
    Tokens();

    boost::spirit::lex::token_def<std::string> identifier;
    boost::spirit::lex::token_def<std::string> string;
    boost::spirit::lex::token_def<bool> boolean;
    boost::spirit::lex::token_def<double> real;
    boost::spirit::lex::token_def<> comment;
    boost::spirit::lex::token_def<> whitespace;
};

template <typename Lexer>
Tokens<Lexer>::Tokens()
{
    // Define regex macros
    this->self.add_pattern
        ("LETTER", "[a-zA-Z_]")
        ("DIGIT", "[0-9]")
        ("INTEGER", "-?{DIGIT}+")
        ("FLOAT", "-?{DIGIT}*\\.{DIGIT}+");

    // Define the tokens' regular expressions
    identifier = "{LETTER}({LETTER}|{DIGIT})*";
    string = "\"[a-zA-Z_0-9]*\"";
    boolean = "true|false";
    real = "{INTEGER}|{FLOAT}";
    comment = "#[^\n\r\f\v]*$";
    whitespace = "\x20\n\r\f\v\t+";

    // Define tokens
    this->self
        = identifier
        | string
        | boolean
        | real
        | '{'
        | '}'
        | '<'
        | '>';

    // Define tokens to be ignored
    this->self("WS")
        = whitespace
        | comment;
}

这是我的令牌和词法分析器类型的定义：

typedef lex::lexertl::token<char const*> TokenType;
typedef lex::lexertl::actor_lexer<TokenType> LexerType;

这是我用于读取令牌并将其值转换为字符串的代码。

Tokens<LexerType> tokens;

std::string string = "9index";
char const* first = string.c_str();
char const* last = &first[string.size()];
LexerType::iterator_type token = tokens.begin(first, last);
LexerType::iterator_type end = tokens.end();

//typedef boost::iterator_range<char const*> iterator_range;
//const iterator_range& range = boost::get<iterator_range>(token->value());
//std::cout << std::string(range.begin(), range.end()) << std::endl;

++token;

token_is_valid(*token); // Returns false ONLY if I uncomment the above code

此代码的输出是“9”（它读取第一个数字，在流中留下“索引”）。如果此时我打印出 string(first, last) 的值，它会显示“ndex”。由于某种原因，词法分析器在那个“i”字符上失败了？

我什至尝试使用 std::stringstream 进行转换，但这也会导致下一个令牌无效：

std::stringstream out;
out << token->value();
std::cout << out.str() << std::endl;

++token;

token_is_valid(*token); // still fails

最后，如果我只是将令牌的值发送到 cout，则下一个令牌是有效的：

std::cout << token->value() << std::endl;

++token;

token_is_valid(*token); // success, what?

关于 token->value() 返回的 iterator_range 如何工作，我缺少什么？我用于将其转换为字符串的方法都没有修改 integer_range 或词法分析器的输入字符流。

编辑：我在这里添加这个，因为评论回复太短，无法完全解释发生了什么。

我想到了。正如 sehe 和 drhirsch 指出的那样，我最初问题中的代码是我实际正在做的事情的经过消毒的版本。我正在使用带有测试夹具类的 gtest 单元测试来测试词法分析器。作为该类的成员，我有 void scan(const std::string& str) 从给定的字符串分配第一个和最后一个迭代器（夹具的数据成员）。问题是一旦我们退出这个函数， const std::string& str 参数就会从堆栈中弹出并且不再存在，即使它们是夹具的数据成员，也会使这些迭代器无效。

故事的寓意：只要您希望读取令牌，迭代器传递给 lexer::begin() 所引用的对象就应该存在。

我宁愿删除这个问题也不愿在互联网上记录我的愚蠢错误，但为了帮助社区，我想我应该离开它。

score 5 · Accepted Answer

从给定的代码来看，您似乎正在查看编译器/库错误。我无法使用以下任何组合重现该问题：

现在编辑包括 clang++ 和 boost 1_49_0。Valgrind 对选定数量的测试用例进行了清理。

铿锵++ 2.9，-O3，提升 1_46_1
铿锵++ 2.9，-O0，提升 1_46_1
铿锵++ 2.9，-O3，提升 1_48_0
铿锵++ 2.9，-O0，提升 1_48_0
铿锵++ 2.9，-O3，提升 1_49_0
铿锵++ 2.9，-O0，提升 1_49_0
gcc 4.4.5，-O0，提升 1_42_1
gcc 4.4.5，-O0，提升 1_46_1
gcc 4.4.5，-O0，提升 1_48_0
gcc 4.4.5，-O0，提升 1_49_0
gcc 4.4.5，-O3，提升 1_42_1
gcc 4.4.5，-O3，提升 1_46_1
gcc 4.4.5，-O3，提升 1_48_0
gcc 4.4.5，-O3，提升 1_49_0
gcc 4.6.1，-O0，提升 1_46_1
gcc 4.6.1，-O0，提升 1_48_0
gcc 4.6.1，-O0，提升 1_49_0
gcc 4.6.1，-O3，提升 1_42_1
gcc 4.6.1，-O3，提升 1_46_1
gcc 4.6.1，-O3，提升 1_48_0
gcc 4.6.1，-O3，提升 1_49_0

完整代码测试：

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>

namespace qi    = boost::spirit::qi;
namespace lex   = boost::spirit::lex;

template <typename Lexer>
struct Tokens : lex::lexer<Lexer>
{
    Tokens();

    lex::token_def<std::string> identifier;
    lex::token_def<std::string> string;
    lex::token_def<bool> boolean;
    lex::token_def<double> real;
    lex::token_def<> comment;
    lex::token_def<> whitespace;
};

template <typename Lexer>
Tokens<Lexer>::Tokens()
{
    // Define regex macros
    this->self.add_pattern
        ("LETTER", "[a-zA-Z_]")
        ("DIGIT", "[0-9]")
        ("INTEGER", "-?{DIGIT}+")
        ("FLOAT", "-?{DIGIT}*\\.{DIGIT}+");

    // Define the tokens' regular expressions
    identifier = "{LETTER}({LETTER}|{DIGIT})*";
    string = "\"[a-zA-Z_0-9]*\"";
    boolean = "true|false";
    real = "{INTEGER}|{FLOAT}";
    comment = "#[^\n\r\f\v]*$";
    whitespace = "\x20\n\r\f\v\t+";

    // Define tokens
    this->self
        = identifier
        | string
        | boolean
        | real
        | '{'
        | '}'
        | '<'
        | '>';

    // Define tokens to be ignored
    this->self("WS")
        = whitespace
        | comment;
}

////////////////////////////////////////////////
typedef lex::lexertl::token<char const*> TokenType;
typedef lex::lexertl::actor_lexer<TokenType> LexerType;

int main(int argc, const char *argv[])
{
    Tokens<LexerType> tokens;

    std::string string = "9index";
    char const* first = string.c_str();
    char const* last = &first[string.size()];
    LexerType::iterator_type token = tokens.begin(first, last);
    LexerType::iterator_type end = tokens.end();

    typedef boost::iterator_range<char const*> iterator_range;
    const iterator_range& range = boost::get<iterator_range>(token->value());
    std::cout << std::string(range.begin(), range.end()) << std::endl;

    ++token;

    // Returns false ONLY if I uncomment the above code
    std::cout << "Next valid: " << std::boolalpha << token_is_valid(*token) << '\n'; 

    return 0;
}

c++ - 如何将 boost::spirit::lex 令牌的值从 iterator_range 转换为字符串？

1 回答 1

Related

Reference