3

In an application, I basically want to have a "pre-parsing" phase where I adjust the token stream before a Qi parser can see it.

One way to do this would be to have some kind of "lexer adaptor" which is constructed from a lexer and is itself a lexer, which wraps and modifies the behavior of the inner lexer. However it would be simpler and easier to debug if instead I just lex the entire input stream with the inner lexer first and store the results in a std::vector<token_type>, then modify as desired, then pass the result to the parser. (In my application I don't think that there would even be any performance concern with this.)

In an email exchange from a few years back, someone described exactly this question and Hartmut said that it should be trivial. http://comments.gmane.org/gmane.comp.parsers.spirit.general/24899

However I didn't find any code examples or instructions how to do this beyond, look at the headers in spirit::lex and figure it out. That will likely occupy me for quite a while now unless you, dear reader, can assist.

The specific question is, how can I make a "shim" lexer which wraps over a pair of std::vector<token_type>::iterator's and looks to spirit::qi just like a standard spirit::lex lexer.

Edit: To be clear, this is not a duplicate of this question: Using Boost.Spirit.Qi with custom lexer My token_types are attributed, and the details of the extra things that Hartmut says I need to do are the substance of this question.


Edit: Okay, I made an SSCCE. This version does not have attributed lexer tokens, but even without that I still can't get it to work yet, and this seems like as good an SSCCE to get started anyways.

Highlights:

"Token buffer" type:

template<typename TokenType>
struct token_buffer {
    std::vector<TokenType> tokens_;

    token_buffer() = default;

    bool operator()(token_type t) {
        tokens_.push_back(t);
        return true;
    }

    void print(std::ostream & o) const { ... }
};

My first attempt at making a "buffer lexer" which looks like a lex::lexer to Qi, but in fact serves tokens from a buffer. This one derives from lex_basic above, I'm not sure if that's correct.

template<typename LexerType>
class buffer_lexer : public lex_basic<LexerType> {
public:
    typedef std::vector<token_type> buff_type;
    typedef typename buff_type::const_iterator iterator_type;

private:
    const buff_type & buff_;

public:
    buffer_lexer(const buff_type & b) : lex_basic<LexerType>(), buff_(b) {}

    iterator_type begin() const { return buff_.begin(); }
    iterator_type end() const { return buff_.end(); }

    // for consistency with regular lexer `begin` signature, not sure if this is needed
    template<typename T>
    iterator_type begin(T, T) { return begin(); }
};

My second attempt at making a buffer lexer. This one does not derive from lex_basic and instead tries to follow these instructions found in the header boost/spirit/home/lex/lexer/lexertl/lexer.hpp:

///////////////////////////////////////////////////////////////////////////
//
//  Every lexer type to be used as a lexer for Spirit has to conform to
//  the following public interface:
//
//    typedefs:
//        iterator_type   The type of the iterator exposed by this lexer.
//        token_type      The type of the tokens returned from the exposed
//                        iterators.
//
//    functions:
//        default constructor
//                        Since lexers are instantiated as base classes
//                        only it might be a good idea to make this
//                        constructor protected.
//        begin, end      Return a pair of iterators, when dereferenced
//                        returning the sequence of tokens recognized in
//                        the input stream given as the parameters to the
//                        begin() function.
//        add_token       Should add the definition of a token to be
//                        recognized by this lexer.
//        clear           Should delete all current token definitions
//                        associated with the given state of this lexer
//                        object.
//
//    template parameters:
//        Iterator        The type of the iterator used to access the
//                        underlying character stream.
//        Token           The type of the tokens to be returned from the
//                        exposed token iterator.
//        Functor         The type of the InputPolicy to use to instantiate
//                        the multi_pass iterator type to be used as the
//                        token iterator (returned from begin()/end()).
//
///////////////////////////////////////////////////////////////////////////

Here's the "buffer_lexer_raw" that I came up with:

template<typename Iterator,
     typename TokenType,
     typename Functor = lex::lexertl::functor<TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
    typedef TokenType token_type;
    typedef std::vector<token_type> buff_type;
    typedef typename buff_type::const_iterator iterator_type;

    typedef typename boost::detail::iterator_traits<typename token_type::iterator_type>::value_type char_type;

private:
    buff_type buff_;

public:
    buffer_lexer_raw() {}

    void set_buffer(const buff_type & b) { buff_ = b; }

    iterator_type begin() const { return buff_.begin(); }
    iterator_type end() const { return buff_.end(); }

    // for consistency with regular lexer `begin` signature, not sure if this is needed
    template<typename T>
    iterator_type begin(T, T) { return begin(); }

    std::size_t add_token(char_type const* state, char_type tokendef,
            std::size_t token_id, char_type const* targetstate)
    {
        return 1;
    }

    void clear(char_type const* state) {}
};

The test code responds to a macro defined at the top of the file.

// Use the type "buffer_lexer" which derives from lex_basic<Lexer>
//#define WHICH_LEXER_TYPE 1
// Use the type "buffer_lexer_raw" which does not derive from anything
//#define WHICH_LEXER_TYPE 2
// Use the "placebo" lexer, which is just lex_basic<Lexer>, as a sanity test of our lex:: api calls
#define WHICH_LEXER_TYPE 0

The test code will:

  • Run the lexer on a simple test case and make a detailed dump of the lexed token sequence.
  • Run the lexer and grammar in tandem on a few simple test cases using lex::tokenize_and_parse, and dump the resulting AST.
  • Try lexing and parsing again, using the lexer selected by the macro to generate iterators for use with qi::parse. It will check that the resulting AST is the same as the AST generated the "easy" way.

Currently the #define WHICH_LEXER_TYPE 0 option compiles and works great for me with both gcc-4.8 and clang-3.6.

I can't actually get it to compile with the #define WHICH_LEXER_TYPE 1 or #define WHICH_LEXER_TYPE 2 options. With type 1, clang gives the following error message which I don't have the foggiest idea about:

In file included from main.cpp:1:
In file included from /usr/include/boost/spirit/include/lex_lexertl.hpp:16:
In file included from /usr/include/boost/spirit/home/lex/lexer_lexertl.hpp:15:
In file included from /usr/include/boost/spirit/home/lex.hpp:13:
In file included from /usr/include/boost/spirit/home/lex/lexer.hpp:14:
In file included from /usr/include/boost/spirit/home/lex/lexer/token_def.hpp:21:
In file included from /usr/include/boost/spirit/home/lex/reference.hpp:16:
/usr/include/boost/spirit/home/qi/reference.hpp:43:30: error: no matching member function for call to 'parse'
            return ref.get().parse(first, last, context, skipper, attr);
                   ~~~~~~~~~~^~~~~
/usr/include/boost/spirit/home/qi/parse.hpp:86:42: note: in instantiation of function template specialization 'boost::spirit::qi::reference<const
      boost::spirit::qi::rule<boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const
      char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
      __gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > >, ast::Body (),
      boost::spirit::locals<std::basic_string<char>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      boost::spirit::unused_type, boost::spirit::unused_type> >::parse<__gnu_cxx::__normal_iterator<const
      boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
      boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
      std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >, boost::spirit::context<boost::fusion::cons<ast::Body &, boost::fusion::nil>,
      boost::spirit::locals<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na> >, boost::spirit::unused_type,
      ast::Body>' requested here
        return compile<qi::domain>(expr).parse(first, last, context, unused, attr);
                                         ^
main.cpp:414:12: note: in instantiation of function template specialization 'boost::spirit::qi::parse<__gnu_cxx::__normal_iterator<const
      boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
      boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
      std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >,
      basic_grammar<boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const
      char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
      __gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > > >, ast::Body>' requested here
                if (!qi::parse(it, fin, bgram, tree2)) {
                         ^
/usr/include/boost/spirit/home/qi/nonterminal/rule.hpp:273:14: note: candidate function [with Context = boost::spirit::context<boost::fusion::cons<ast::Body &,
      boost::fusion::nil>, boost::spirit::locals<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na> >, Skipper =
      boost::spirit::unused_type, Attribute = ast::Body] not viable: no known conversion from '__gnu_cxx::__normal_iterator<const
      boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
      boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
      std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >' to
      'boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *,
      std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
      __gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > > &' for 1st argument
        bool parse(Iterator& first, Iterator const& last
             ^
/usr/include/boost/spirit/home/qi/nonterminal/rule.hpp:319:14: note: candidate function template not viable: requires 6 arguments, but 5 were provided
        bool parse(Iterator& first, Iterator const& last
             ^
1 error generated.

The "2" option gives essentially the same error message. gcc doesn't seem to give a better error message.

Here's the complete source code:

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>

#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/variant/get.hpp>
#include <boost/variant/variant.hpp>
#include <boost/variant/recursive_variant.hpp>
#include <boost/preprocessor/stringize.hpp>

#include <vector>
#include <string>

typedef unsigned int uint;

namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;
namespace mpl = boost::mpl;

// Use the type "buffer_lexer" which derives from lex_basic<Lexer>
//#define WHICH_LEXER_TYPE 1
// Use the type "buffer_lexer_raw" which does not derive from anything
//#define WHICH_LEXER_TYPE 2
// Use the "placebo" lexer, which is just lex_basic<Lexer>, as a sanity test of
// our lex:: api calls
#define WHICH_LEXER_TYPE 0

//// Lexer definition

enum tokenids {
  LCARET = lex::min_token_id + 10,
  RCARET,
  BSLASH,
  LBRACE,
  RBRACE,
  LPAREN,
  RPAREN,
  EQUALS,
  USCORE,
  ALPHA,
  NUM,
  EOL,
  BLANK,
  IDANY
};

#define TOKEN_CASE(X)                                                          \
  case X: return #X

const char *token_id_string(size_t id) {
  switch (id) {
    TOKEN_CASE(LCARET);
    TOKEN_CASE(RCARET);
    TOKEN_CASE(BSLASH);
    TOKEN_CASE(LBRACE);
    TOKEN_CASE(RBRACE);
    TOKEN_CASE(LPAREN);
    TOKEN_CASE(RPAREN);
    TOKEN_CASE(EQUALS);
    TOKEN_CASE(USCORE);
    TOKEN_CASE(ALPHA);
    TOKEN_CASE(NUM);
    TOKEN_CASE(EOL);
    TOKEN_CASE(BLANK);
    TOKEN_CASE(IDANY);
  default:
    return "Unknown token";
  }
}

template <typename Lexer> struct lex_basic : lex::lexer<Lexer> {
  lex_basic() {
    this->self.add
        ('<', LCARET)
        ('>', RCARET)
        ('/', BSLASH)
        ('{', LBRACE)
        ('}', RBRACE)
        ('(', LPAREN)
        (')', RPAREN)
        ('=', EQUALS)
        ('_', USCORE)
        ("[A-Za-z]", ALPHA)
        ("[0-9]", NUM)
        ('\n', EOL)
        ("[ \\t\\r]", BLANK)
        (".", IDANY);
  }
};

typedef std::string::const_iterator str_it;
// the token type needs to know the iterator type of the underlying
// input and the set of used token value types
typedef lex::lexertl::token<str_it, mpl::vector<char>> token_type;

template <typename TokenType> struct token_buffer {
  std::vector<TokenType> tokens_;

  token_buffer() = default;

  bool operator()(token_type t) {
    tokens_.push_back(t);
    return true;
  }

  void print(std::ostream &o) const {
    std::cout << "tokens_.size() == " << tokens_.size() << std::endl;
    for (size_t i = 0; i < tokens_.size(); ++i) {
      const TokenType &t = tokens_[i];

      o << "[" << i << "]: -" << token_id_string(t.id()) << "- \"" << t
        << "\" [";

      const auto &v = t.value();
      if (t.id() == EOL) {
        o << "\\n";
      } else {
        o << v;
      }
      o << "]" << std::endl;
    }
  }
};

/***
 * Lexers which serve tokens from a buffer
 */

// Two versions of the same thing, one deriving from lex::lexer, one not
template <typename LexerType> class buffer_lexer : public lex_basic<LexerType> {
public:
  typedef std::vector<token_type> buff_type;
  typedef typename buff_type::const_iterator iterator_type;

private:
  const buff_type &buff_;

public:
  buffer_lexer(const buff_type &b) : lex_basic<LexerType>(), buff_(b) {}

  iterator_type begin() const { return buff_.begin(); }
  iterator_type end() const { return buff_.end(); }

  // for consistency with regular lexer `begin` signature, not sure if this is
  // needed
  template <typename T> iterator_type begin(T, T) { return begin(); }
};

template <typename Iterator, typename TokenType,
          typename Functor = lex::lexertl::functor<
          TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
  typedef TokenType token_type;
  typedef std::vector<token_type> buff_type;
  typedef typename buff_type::const_iterator iterator_type;

  typedef typename boost::detail::iterator_traits<
      typename token_type::iterator_type>::value_type char_type;

private:
  buff_type buff_;

public:
  buffer_lexer_raw() {}

  void set_buffer(const buff_type &b) { buff_ = b; }

  iterator_type begin() const { return buff_.begin(); }
  iterator_type end() const { return buff_.end(); }

  // for consistency with regular lexer `begin` signature, not sure if this is
  // needed
  template <typename T> iterator_type begin(T, T) { return begin(); }

  std::size_t add_token(char_type const *state, char_type tokendef,
        std::size_t token_id, char_type const *targetstate) {
    return 1;
  }

  void clear(char_type const *state) {}
};

/***
 * AST
 */

namespace ast {
typedef std::string Str;

struct BraceExpr;

typedef boost::variant<Str, boost::recursive_wrapper<BraceExpr>> BraceExprArg;

struct BraceExpr {
  std::vector<BraceExprArg> args;
};

typedef std::pair<Str, Str> Pair;

struct Body;

typedef boost::variant<Pair, BraceExpr, boost::recursive_wrapper<Body>> Node;

struct Body {
  Str key;
  std::vector<Node> nodes;
};
} // end namespace ast

BOOST_FUSION_ADAPT_STRUCT(ast::BraceExpr,
          (std::vector<ast::BraceExprArg>, args))
BOOST_FUSION_ADAPT_STRUCT(ast::Body,
          (ast::Str, key)(std::vector<ast::Node>, nodes))

namespace ast {
// Stream ops
class printer : public boost::static_visitor<> {
  std::ostream &ss_;
  uint indent_;
  std::string indent(uint extra = 0) const {
    return std::string(indent_ + extra, ' ');
  }
  std::string indent_plus_tab() const { return indent(tab_width); }

public:
  static constexpr uint tab_width = 4;

  explicit printer(std::ostream &s, uint indent = 0)
      : ss_(s), indent_(indent) {}

  void operator()(const Str &s) const { ss_ << s; }
  void operator()(const BraceExpr &b) const {
    ss_ << "{";
    for (size_t i = 0; i < b.args.size(); ++i) {
      if (i) {
        ss_ << " ";
      }
      boost::apply_visitor(*this, b.args[i]);
    }
    ss_ << "}";
  }
  void operator()(const Pair &p) const { ss_ << p.first << " = " << p.second; }

  void operator()(const Body &b) const {
    ss_ << indent() << "<" << b.key << ">\n";
    printer p{ss_, indent_ + tab_width};
    for (const auto &n : b.nodes) {
      ss_ << indent_plus_tab();
      boost::apply_visitor(p, n);
      ss_ << "\n";
    }
    ss_ << indent() << "</" << b.key << ">";
  }
};

std::ostream &operator<<(std::ostream &ss, const BraceExpr &b) {
  printer p{ss};
  p(b);
  return ss;
}

std::ostream &operator<<(std::ostream &ss, const Pair &p) {
  printer pr{ss};
  pr(p);
  return ss;
}

std::ostream &operator<<(std::ostream &ss, const Body &b) {
  printer p{ss};
  p(b);
  return ss;
}

// Equality ops
bool operator==(const Pair &p1, const Pair &p2) {
  return p1.first == p2.first && p1.second == p2.second;
}
bool operator==(const BraceExpr &b1, const BraceExpr &b2) {
  return b1.args == b2.args;
}
bool operator==(const Body &b1, const Body &b2) {
  return b1.key == b2.key && b1.nodes == b2.nodes;
}
bool operator!=(const Pair &p1, const Pair &p2) { return !(p1 == p2); }
bool operator!=(const BraceExpr &b1, const BraceExpr &b2) {
  return !(b1 == b2);
}
bool operator!=(const Body &b1, const Body &b2) { return !(b1 == b2); }
} // end namespace ast

/***
 * Grammar
 */

template <typename Iterator>
struct basic_grammar
    : qi::grammar<Iterator, ast::Body(), qi::locals<ast::Str>> {
  qi::rule<Iterator, ast::Body(), qi::locals<ast::Str>> body;
  qi::rule<Iterator, ast::Node()> node;
  qi::rule<Iterator, ast::Pair()> pair;
  qi::rule<Iterator, ast::BraceExprArg()> brace_expr_arg;
  qi::rule<Iterator, ast::BraceExpr()> brace_expr;
  qi::rule<Iterator, ast::Str()> identifier;
  qi::rule<Iterator, ast::Str()> str;
  qi::rule<Iterator, ast::Str()> open_tag;
  qi::rule<Iterator /*, ast::Str()*/> close_tag;
  qi::rule<Iterator> lbrace;
  qi::rule<Iterator> rbrace;
  qi::rule<Iterator> equals;

  qi::rule<Iterator> ws;

  template <typename TokenDef>
  basic_grammar(const TokenDef &tok)
      : basic_grammar::base_type(body, "body") {
    using namespace qi;

    ws %= token(BLANK) | token(EOL);
    lbrace %= token(LBRACE);
    rbrace %= token(RBRACE);
    equals %= token(EQUALS);
    identifier %= token(ALPHA) >> *(token(ALPHA) | token(NUM) | token(USCORE));
    str %= *(token(LCARET) | token(RCARET) | token(BSLASH) | token(LPAREN) |
         token(RPAREN) | token(ALPHA) | token(NUM) | token(USCORE) |
         token(EQUALS) | token(BLANK) | token(IDANY));
    open_tag %= omit[token(LCARET)] >> identifier >>
        omit[token(RCARET)]; // tok.open_tag;
    close_tag %= omit[token(LCARET) >> token(BSLASH)] >> identifier >>
         omit[token(RCARET)]; // tok.close_tag;

    pair = skip(boost::proto::deep_copy(ws))[identifier >> equals >> str];

    body = skip(boost::proto::deep_copy(ws))[open_tag >> *node >> close_tag];
    node = brace_expr | body | pair;

    brace_expr_arg = brace_expr | identifier;
    brace_expr =
        skip(boost::proto::deep_copy(ws))[lbrace >> *brace_expr_arg >> rbrace];
  }
};

/***
 * Usage / Tests
 */

// use actor_lexer<> here if your token definitions have semantic
// actions
typedef lex::lexertl::lexer<token_type> lexer_type;

// this is the iterator exposed by the lexer, we use this for parsing
typedef lexer_type::iterator_type iterator_type;

token_buffer<token_type> test_lexer(const std::string &input,
        bool silent = false) {
  str_it s = input.begin();
  str_it end = input.end();

  // create a lexer instance
  lex_basic<lexer_type> lex;

  token_buffer<token_type> buff;
  if (!lex::tokenize(s, end, lex, [&](token_type t) { return buff(t); })) {
    if (!silent) {
      std::cout << "\nTokenizing failed!" << std::endl;
    }
  } else {
    if (!silent) {
      std::cout << "\nTokenizing succeeded!" << std::endl;
    }
  }

  if (!silent) {
    buff.print(std::cout);
  }
  return buff;
}

void test_grammar(const std::string &input) {
  lex_basic<lexer_type> lex;
  basic_grammar<iterator_type> gram{lex};
  ast::Body tree;

  {
    str_it s = input.begin();
    str_it end = input.end();

    if (!lex::tokenize_and_parse(s, end, lex, gram, tree)) {
      std::cout << "\nParsing failed!" << std::endl;
    } else {
      std::cout << "\nParsing succeeded!" << std::endl;
    }

    std::cout << tree << std::endl;
  }

  // Now try to do it in two steps, with buffered lexer
  auto buff = test_lexer(input, true); // get buffer, silence output

#if WHICH_LEXER_TYPE == 1
  buffer_lexer<lexer_type> blex{buff.tokens_};
#else
#if WHICH_LEXER_TYPE == 2
  buffer_lexer_raw<str_it, token_type> blex;
  blex.set_buffer(buff.tokens_);
#else
  lex_basic<lexer_type> blex;
#endif
#endif

  basic_grammar<iterator_type> bgram{blex};
  ast::Body tree2;

  {
#if (WHICH_LEXER_TYPE == 1) || (WHICH_LEXER_TYPE == 2)
    auto it = blex.begin();
#else
    str_it s = input.begin();
    str_it end = input.end();
    auto it = blex.begin(s, end);
#endif

    auto fin = blex.end();

    if (!qi::parse(it, fin, bgram, tree2)) {
      std::cout << "\nBuffered parsing failed!" << std::endl;
    } else {
      std::cout << "\nBuffered parsing succeeded!" << std::endl;
    }
  }

  std::cout << tree2 << std::endl;

  if (tree != tree2) {
    std::cout << "\nRegular parsing vs. buffered parsing mismatch!"
          << std::endl;
  }
}

int main() {
  std::string input{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"</asdf>\n"};

  test_lexer(input);

  // Use lexer and grammar at once as demonstrated in tutorials

  std::string input2 = "<asdf></asdf>";
  test_grammar(input2);

  test_grammar(input);

  std::string input3{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"<jkl>\n"
"baz = gaz\n"
"{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}\n"
"</jkl>\n"
"</asdf>\n"};

  test_grammar(input3);

  return 0;
}
4

1 回答 1

3

我也认为应该归咎于多通道,但经过多次摆弄,我能够通过 2 个简单的修复方法让它工作¹

template <typename Iterator, typename TokenType,
          typename Functor = lex::lexertl::functor<
          TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
    typedef TokenType token_type;
    typedef std::vector<token_type> buff_type;
    typedef typename buff_type::const_iterator base_iterator_type;
  public:

    struct iterator_type : base_iterator_type {
        typedef base_iterator_type base_iterator_type;
        using base_iterator_type::base_iterator_type;
    };

    typedef char char_type;

这确保了嵌套iterator_type有自己的base_iterator_type类型。这似乎在图书馆内部的某个地方需要(可能是由于关于令牌迭代器的假设)。

第二部分是语法实际实例化的地方,不要使用“普通”迭代器,而是我们刚刚定义的那个:

basic_grammar<concrete_lexer_type::iterator_type> bgram{blex};

完整的工作清单:

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>

#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/variant/get.hpp>
#include <boost/variant/variant.hpp>
#include <boost/variant/recursive_variant.hpp>
#include <boost/preprocessor/stringize.hpp>

#include <vector>
#include <string>

typedef unsigned int uint;

namespace lex = boost::spirit::lex;
namespace qi  = boost::spirit::qi;
namespace mpl = boost::mpl;

//// Lexer definition

enum tokenids {
  LCARET = lex::min_token_id + 10,
  RCARET,
  BSLASH,
  LBRACE,
  RBRACE,
  LPAREN,
  RPAREN,
  EQUALS,
  USCORE,
  ALPHA,
  NUM,
  EOL,
  BLANK,
  IDANY
};

#define TOKEN_CASE(X)                                                          \
  case X: return #X

const char *token_id_string(size_t id) {
  switch (id) {
    TOKEN_CASE(LCARET);
    TOKEN_CASE(RCARET);
    TOKEN_CASE(BSLASH);
    TOKEN_CASE(LBRACE);
    TOKEN_CASE(RBRACE);
    TOKEN_CASE(LPAREN);
    TOKEN_CASE(RPAREN);
    TOKEN_CASE(EQUALS);
    TOKEN_CASE(USCORE);
    TOKEN_CASE(ALPHA);
    TOKEN_CASE(NUM);
    TOKEN_CASE(EOL);
    TOKEN_CASE(BLANK);
    TOKEN_CASE(IDANY);
  default:
    return "Unknown token";
  }
}

template <typename Lexer> struct lex_basic : lex::lexer<Lexer> {
  lex_basic() {
    this->self.add
        ('<', LCARET)
        ('>', RCARET)
        ('/', BSLASH)
        ('{', LBRACE)
        ('}', RBRACE)
        ('(', LPAREN)
        (')', RPAREN)
        ('=', EQUALS)
        ('_', USCORE)
        ("[A-Za-z]", ALPHA)
        ("[0-9]", NUM)
        ('\n', EOL)
        ("[ \\t\\r]", BLANK)
        (".", IDANY);
  }
};

typedef std::string::const_iterator str_it;
// the token type needs to know the iterator type of the underlying
// input and the set of used token value types
typedef lex::lexertl::token<str_it, mpl::vector<char>> token_type;

template <typename TokenType> struct token_buffer {
  std::vector<TokenType> tokens_;

  token_buffer() = default;

  bool operator()(token_type t) {
    tokens_.push_back(t);
    return true;
  }

  void print(std::ostream &o) const {
    std::cout << "tokens_.size() == " << tokens_.size() << std::endl;
    for (size_t i = 0; i < tokens_.size(); ++i) {
      const TokenType &t = tokens_[i];

      o << "[" << i << "]: -" << token_id_string(t.id()) << "- \"" << t
        << "\" [";

      const auto &v = t.value();
      if (t.id() == EOL) {
        o << "\\n";
      } else {
        o << v;
      }
      o << "]" << std::endl;
    }
  }
};

/***
 * Lexers which serve tokens from a buffer
 */

// Two versions of the same thing, one deriving from lex::lexer, one not
template <typename LexerType> class buffer_lexer : public lex_basic<LexerType> {
public:
  typedef std::vector<token_type> buff_type;
  typedef typename buff_type::const_iterator iterator_type;

private:
  const buff_type &buff_;

public:
  buffer_lexer(const buff_type &b) : lex_basic<LexerType>(), buff_(b) {}

  iterator_type begin() const { return buff_.begin(); }
  iterator_type end()   const { return buff_.end(); }

  // for consistency with regular lexer `begin` signature, not sure if this is
  // needed
  template <typename T> iterator_type begin(T, T) { return begin(); }
};

template <typename Iterator, typename TokenType,
          typename Functor = lex::lexertl::functor<
          TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
    typedef TokenType token_type;
    typedef std::vector<token_type> buff_type;
    typedef typename buff_type::const_iterator vec_iterator_type;
  public:

    struct iterator_type : vec_iterator_type {
        typedef vec_iterator_type base_iterator_type;
        using vec_iterator_type::vec_iterator_type;
    };

    typedef char char_type;

private:
    buff_type buff_;

public:
    buffer_lexer_raw() {}

    void set_buffer(const buff_type &b) { buff_ = b; }

    iterator_type begin() const { return buff_.begin(); } 
    iterator_type end()   const { return buff_.end();   } 

    // for consistency with regular lexer `begin` signature, not sure if this is
    // needed
    template <typename T> iterator_type begin(T, T) { return begin(); }

    std::size_t add_token(char_type const*, char_type, std::size_t, char_type const*) {
        return 1;
    }

    void clear(char_type const *) {}
};

/***
 * AST
 */

namespace ast {
    typedef std::string Str;

    struct BraceExpr;

    typedef boost::variant<Str, boost::recursive_wrapper<BraceExpr>> BraceExprArg;

    struct BraceExpr {
        std::vector<BraceExprArg> args;
    };

    typedef std::pair<Str, Str> Pair;

    struct Body;

    typedef boost::variant<Pair, BraceExpr, boost::recursive_wrapper<Body>> Node;

    struct Body {
        Str key;
        std::vector<Node> nodes;
    };
} // end namespace ast

BOOST_FUSION_ADAPT_STRUCT(ast::BraceExpr,
          (std::vector<ast::BraceExprArg>, args))
BOOST_FUSION_ADAPT_STRUCT(ast::Body,
          (ast::Str, key)(std::vector<ast::Node>, nodes))

namespace ast {
    // Stream ops
    class printer : public boost::static_visitor<> {
        std::ostream &ss_;

        uint indent_;
        std::string indent(uint extra = 0) const { return std::string(indent_ + extra, ' '); }
        std::string indent_plus_tab() const { return indent(tab_width); }

      public:
        static constexpr uint tab_width = 4;

        explicit printer(std::ostream &s, uint indent = 0)
            : ss_(s), indent_(indent) {}

        void operator()(const Str &s) const { ss_ << s; }
        void operator()(const BraceExpr &b) const {
            ss_ << "{";
            for (size_t i = 0; i < b.args.size(); ++i) {
                if (i) {
                    ss_ << " ";
                }
                boost::apply_visitor(*this, b.args[i]);
            }
            ss_ << "}";
        }
        void operator()(const Pair &p) const { ss_ << p.first << " = " << p.second; }

        void operator()(const Body &b) const {
            ss_ << indent() << "<" << b.key << ">\n";
            printer p{ss_, indent_ + tab_width};
            for (const auto &n : b.nodes) {
                ss_ << indent_plus_tab();
                boost::apply_visitor(p, n);
                ss_ << "\n";
            }
            ss_ << indent() << "</" << b.key << ">";
        }
    };

    std::ostream &operator<<(std::ostream &ss, const BraceExpr &b) {
        printer p{ss};
        p(b);
        return ss;
    }

    std::ostream &operator<<(std::ostream &ss, const Pair &p) {
        printer pr{ss};
        pr(p);
        return ss;
    }

    std::ostream &operator<<(std::ostream &ss, const Body &b) {
        printer p{ss};
        p(b);
        return ss;
    }

    // Equality ops
    bool operator==(const Pair &p1, const Pair &p2) {
        return p1.first == p2.first && p1.second == p2.second;
    }
    bool operator==(const BraceExpr &b1, const BraceExpr &b2) {
        return b1.args == b2.args;
    }
    bool operator==(const Body &b1, const Body &b2) {
        return b1.key == b2.key && b1.nodes == b2.nodes;
    }
    bool operator!=(const Pair &p1, const Pair &p2) { return !(p1 == p2); }
    bool operator!=(const BraceExpr &b1, const BraceExpr &b2) {
        return !(b1 == b2);
    }
    bool operator!=(const Body &b1, const Body &b2) { return !(b1 == b2); }
} // end namespace ast

/***
 * Grammar
 */

template <typename Iterator>
struct basic_grammar : qi::grammar<Iterator, ast::Body(), qi::locals<ast::Str>> {
    qi::rule<Iterator, ast::Body(), qi::locals<ast::Str>> body;
    qi::rule<Iterator, ast::Node()>         node;
    qi::rule<Iterator, ast::Pair()>         pair;
    qi::rule<Iterator, ast::BraceExprArg()> brace_expr_arg;
    qi::rule<Iterator, ast::BraceExpr()>    brace_expr;
    qi::rule<Iterator, ast::Str()>          identifier;
    qi::rule<Iterator, ast::Str()>          str;
    qi::rule<Iterator, ast::Str()>          open_tag;
    qi::rule<Iterator  /*, ast::Str()*/>    close_tag;
    qi::rule<Iterator> lbrace;
    qi::rule<Iterator> rbrace;
    qi::rule<Iterator> equals;

    qi::rule<Iterator> ws;

    template <typename TokenDef>
    basic_grammar(const TokenDef &tok) : basic_grammar::base_type(body, "body") {
        using namespace qi;

        ws            %= token(BLANK) | token(EOL);
        lbrace        %= token(LBRACE);
        rbrace        %= token(RBRACE);
        equals        %= token(EQUALS);
        identifier    %= token(ALPHA) >> *(token(ALPHA) | token(NUM) | token(USCORE));
        str           %= *(token(LCARET) | token(RCARET) | token(BSLASH) | token(LPAREN) |
                token(RPAREN) | token(ALPHA)  | token(NUM)    | token(USCORE) |
                token(EQUALS) | token(BLANK)  | token(IDANY));

        open_tag      %= omit[token(LCARET)]                   >> identifier >> omit[token(RCARET)]; // tok.open_tag;
        close_tag     %= omit[token(LCARET)  >> token(BSLASH)] >> identifier >> omit[token(RCARET)]; // tok.close_tag;

        // TODO FIXME the deep_copy shoudl be not required there
        /// bla_12 = somevalue

        pair           = skip(boost::proto::deep_copy(ws)) [ identifier >> equals >> str    ] ;

        /// <bla><sub>{some}{braced{expres}}sions</sub><pair1>key1=value</pair1></bla>
        body           = skip(boost::proto::deep_copy(ws)) [ open_tag >> *node >> close_tag ] ;
        /// 
        node           = brace_expr | body | pair;

        brace_expr_arg = brace_expr | identifier;

        /// {{{bla}some{other}nested{id{entifier}s}}and such}
        brace_expr     = skip(boost::proto::deep_copy(ws))[lbrace >> *brace_expr_arg >> rbrace];
    }
};

/***
 * Usage / Tests
 */

// use actor_lexer<> here if your token definitions have semantic
// actions
typedef lex::lexertl::lexer<token_type> lexer_type;

// this is the iterator exposed by the lexer, we use this for parsing
typedef lexer_type::iterator_type iterator_type;

token_buffer<token_type> test_lexer(const std::string &input, 
        bool silent = false) {
    str_it s   = input.begin();
    str_it end = input.end();

    // create a lexer instance
    lex_basic<lexer_type> lex;

    token_buffer<token_type> buff;
    if (!lex::tokenize(s, end, lex, [&](token_type t) { return buff(t); })) {
        if (!silent) {
            std::cout << "\nTokenizing failed!" << std::endl;
        }
    } else {
        if (!silent) {
            std::cout << "\nTokenizing succeeded!" << std::endl;
        }
    }

    if (!silent) {
        buff.print(std::cout);
    }
    return buff;
}

void test_grammar(const std::string &input) {
    lex_basic<lexer_type> lex;
    basic_grammar<iterator_type> gram{lex};
    ast::Body tree;

    {
        str_it s = input.begin();
        str_it end = input.end();

        if (!lex::tokenize_and_parse(s, end, lex, gram, tree)) {
            std::cout << "\nParsing failed!" << std::endl;
        } else {
            std::cout << "\nParsing succeeded!" << std::endl;
        }

        std::cout << tree << std::endl;
    }

    // Now try to do it in two steps, with buffered lexer
    auto buff = test_lexer(input, true); // get buffer, silence output

    typedef buffer_lexer_raw<str_it, token_type> concrete_lexer_type;

    buffer_lexer_raw<str_it, token_type> blex;
    blex.set_buffer(buff.tokens_);


    basic_grammar<concrete_lexer_type::iterator_type> bgram{blex};
    ast::Body tree2;

    {
        auto it = blex.begin();
        auto fin = blex.end();

        if (!qi::parse(it, fin, bgram, tree2)) {
            std::cout << "\nBuffered parsing failed!" << std::endl;
        } else {
            std::cout << "\nBuffered parsing succeeded!" << std::endl;
        }
    }

    std::cout << tree2 << std::endl;

    if (tree != tree2) {
        std::cout << "\nRegular parsing vs. buffered parsing mismatch!"
            << std::endl;
    }
}

int main() {
std::string const input{""
    "<asdf>\n"
    "foo = bar\n"
    "{F foo}\n"
    "{G {F foo} {H bar}}\n"
    "</asdf>\n"};

    test_lexer(input);

    // Use lexer and grammar at once as demonstrated in tutorials

    std::string const input2 = "<asdf></asdf>";
    test_grammar(input2);

    test_grammar(input);

    std::string const input3{""
        "<asdf>\n"
        "foo = bar\n"
        "{F foo}\n"
        "{G {F foo} {H bar}}\n"
        "<jkl>\n"
        "baz = gaz\n"
        "{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}\n"
        "</jkl>\n"
        "</asdf>\n"};

    test_grammar(input3);
}

印刷:

Tokenizing succeeded!
tokens_.size() == 53
[0]: -LCARET- "65546" [<]
[1]: -ALPHA- "65555" [a]
[2]: -ALPHA- "65555" [s]
[3]: -ALPHA- "65555" [d]
[4]: -ALPHA- "65555" [f]
[5]: -RCARET- "65547" [>]
[6]: -EOL- "65557" [\n]
[7]: -ALPHA- "65555" [f]
[8]: -ALPHA- "65555" [o]
[9]: -ALPHA- "65555" [o]
[10]: -BLANK- "65558" [ ]
[11]: -EQUALS- "65553" [=]
[12]: -BLANK- "65558" [ ]
[13]: -ALPHA- "65555" [b]
[14]: -ALPHA- "65555" [a]
[15]: -ALPHA- "65555" [r]
[16]: -EOL- "65557" [\n]
[17]: -LBRACE- "65549" [{]
[18]: -ALPHA- "65555" [F]
[19]: -BLANK- "65558" [ ]
[20]: -ALPHA- "65555" [f]
[21]: -ALPHA- "65555" [o]
[22]: -ALPHA- "65555" [o]
[23]: -RBRACE- "65550" [}]
[24]: -EOL- "65557" [\n]
[25]: -LBRACE- "65549" [{]
[26]: -ALPHA- "65555" [G]
[27]: -BLANK- "65558" [ ]
[28]: -LBRACE- "65549" [{]
[29]: -ALPHA- "65555" [F]
[30]: -BLANK- "65558" [ ]
[31]: -ALPHA- "65555" [f]
[32]: -ALPHA- "65555" [o]
[33]: -ALPHA- "65555" [o]
[34]: -RBRACE- "65550" [}]
[35]: -BLANK- "65558" [ ]
[36]: -LBRACE- "65549" [{]
[37]: -ALPHA- "65555" [H]
[38]: -BLANK- "65558" [ ]
[39]: -ALPHA- "65555" [b]
[40]: -ALPHA- "65555" [a]
[41]: -ALPHA- "65555" [r]
[42]: -RBRACE- "65550" [}]
[43]: -RBRACE- "65550" [}]
[44]: -EOL- "65557" [\n]
[45]: -LCARET- "65546" [<]
[46]: -BSLASH- "65548" [/]
[47]: -ALPHA- "65555" [a]
[48]: -ALPHA- "65555" [s]
[49]: -ALPHA- "65555" [d]
[50]: -ALPHA- "65555" [f]
[51]: -RCARET- "65547" [>]
[52]: -EOL- "65557" [\n]

Parsing succeeded!
<asdf>
</asdf>

Buffered parsing succeeded!
<asdf>
</asdf>

Parsing succeeded!
<asdf>
    foo = bar
    {F foo}
    {G {F foo} {H bar}}
</asdf>

Buffered parsing succeeded!
<asdf>
    foo = bar
    {F foo}
    {G {F foo} {H bar}}
</asdf>

Parsing succeeded!
<asdf>
    foo = bar
    {F foo}
    {G {F foo} {H bar}}
        <jkl>
        baz = gaz
        {H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}
    </jkl>
</asdf>

Buffered parsing succeeded!
<asdf>
    foo = bar
    {F foo}
    {G {F foo} {H bar}}
        <jkl>
        baz = gaz
        {H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}
    </jkl>
</asdf>

¹ 基于buffer_lexer_raw方法

于 2015-08-25T01:37:55.427 回答