c++ - 如何使 Boost.Spirit.Lex 标记值成为匹配序列的子字符串（最好通过正则表达式匹配组）

Question

我正在编写一个简单的表达式解析器。它建立在 Boost.Spirit.Qi 语法的基础上，该语法基于 Boost.Spirit.Lex 标记（Boost 版本 1.56）。

令牌定义如下：

using namespace boost::spirit;

template<
    typename lexer_t
>
struct tokens
    : lex::lexer<lexer_t>
{
    tokens()
        : /* ... */,
          variable("%(\\w+)")
    {
        this->self =
            /* ... */ |
            variable;
    }

    /* ... */
    lex::token_def<std::string> variable;
};

现在我希望variable令牌值只是(\\w+)没有前缀%符号的名称（匹配组）。我怎么做？

单独使用匹配组无济于事。仍然值是完整的字符串，包括前缀%。

有没有办法强制使用匹配组？

或者至少以某种方式在令牌的作用下引用它？

我也尝试过使用这样的操作：

variable[lex::_val = std::string(lex::_start + 1, lex::_end)]

但它无法编译。错误声称没有一个std::string构造函数重载可以匹配参数：

(const boost::phoenix::actor<Expr>, const boost::spirit::lex::_end_type)

更简单

variable[lex::_val = std::string(lex::_start, lex::_end)]

编译失败。出于类似的原因，现在只有第一个参数类型是boost::spirit::lex::_start_type.

最后我尝试了这个（即使它看起来像一个很大的浪费）：

lex::_val = std::string(lex::_val).erase(0, 1)

但这也无法编译。这次编译器无法从转换const boost::spirit::lex::_val_type为std::string.

有没有办法解决这个问题？

score 1 · Accepted Answer

简单的解决方案

std::string构造属性值的正确形式如下：

variable[lex::_val = boost::phoenix::construct<std::string>(lex::_start + 1, lex::_end)]

完全按照jv_在他（或她）评论中的建议。

boost::phoenix::construct由<boost/phoenix/object/construct.hpp>标头提供。或使用<boost/phoenix.hpp>.

正则表达式解决方案

然而，上述解决方案仅适用于简单的情况。并且排除了从外部（特别是配置数据）提供模式的可能性。因为例如将模式更改为%(\\w+)%将需要更改值构造代码。

这就是为什么能够从定义令牌的正则表达式中引用捕获组会更好。

现在请注意，这仍然不是完美的，因为奇怪的情况%(\\w+)%(\\w+)%仍然需要更改代码才能正确处理。这可以通过不仅为令牌配置正则表达式而且还意味着从匹配范围形成值来解决。然而，这超出了问题的范围。在许多情况下，直接使用捕获组似乎足够灵活。

sehe在其他地方的评论中指出，没有办法使用令牌正则表达式中的捕获组。更不用说令牌实际上只支持正则表达式的一个子集。（在显着差异中，例如缺乏对命名捕获组或忽略它们的支持！）。

我自己在这方面的实验也支持这一点。遗憾的是，没有办法使用捕获组。但是有一种解决方法 - 您只需在您的操作中重新应用正则表达式。

获取捕获范围的动作

为了使它有点模块化，让我们从一个最简单的任务开始 - 一个返回boost::iterator_range与指定捕获相对应的令牌匹配部分的操作。

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture
{
public:
    typedef lex::token_def<Attribute, Char, Idtype> token_type;
    typedef boost::basic_regex<Char> regex_type;

    explicit basic_get_capture(token_type const& token, int capture_index = 1)
        : token(token),
          regex(),
          capture_index(capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    boost::iterator_range<Iterator> operator ()(Iterator& first, Iterator& last, lex::pass_flags& /*flag*/, IdType& /*id*/, Context& /*context*/)
    {
        typedef boost::match_results<Iterator> match_results_type;

        match_results_type results;
        regex_match(first, last, results, get_regex());
        typename match_results_type::const_reference capture = results[capture_index];
        return boost::iterator_range<Iterator>(capture.first, capture.second);
    }

private:
    regex_type& get_regex()
    {
        if(regex.empty())
        {
            token_type::string_type const& regex_text = token.definition();
            regex.assign(regex_text);
        }
        return regex;
    }

    token_type const& token;
    regex_type regex;
    int capture_index;
};

template<typename Attribute, typename Char, typename Idtype>
basic_get_capture<Attribute, Char, Idtype> get_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture<Attribute, Char, Idtype>(token, capture_index);
}

该操作使用Boost.Regex (include <boost/regex.hpp>)。

以字符串形式获取捕获的操作

现在，由于捕获范围是一件好事，因为它没有为字符串分配任何新内存，所以它毕竟是我们最终想要的字符串。所以这里的另一个动作建立在前一个动作的基础上。

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture_as_string
{
public:
    typedef basic_get_capture<Attribute, Char, Idtype> basic_get_capture_type;
    typedef typename basic_get_capture_type::token_type token_type;

    explicit basic_get_capture_as_string(token_type const& token, int capture_index = 1)
        : get_capture_functor(token, capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    std::basic_string<Char> operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        boost::iterator_range<Iterator> const& capture = get_capture_functor(first, last, flag, id, context);
        return std::basic_string<Char>(capture.begin(), capture.end());
    }

private:
    basic_get_capture_type get_capture_functor;
};

template<typename Attribute, typename Char, typename Idtype>
basic_get_capture_as_string<Attribute, Char, Idtype> get_capture_as_string(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture_as_string<Attribute, Char, Idtype>(token, capture_index);
}

这里没有魔法。我们只是std::basic_string从更简单的操作返回的范围中创建一个。

从捕获中分配值的操作

返回值的操作对我们几乎没有用处。最终目标是从捕获中设置令牌值。这是通过最后一个动作完成的。

template<typename Attribute, typename Char, typename Idtype>
class basic_set_val_from_capture
{
public:
    typedef basic_get_capture_as_string<Attribute, Char, Idtype> basic_get_capture_as_string_type;
    typedef typename basic_get_capture_as_string_type::token_type token_type;

    explicit basic_set_val_from_capture(token_type const& token, int capture_index = 1)
        : get_capture_as_string_functor(token, capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    void operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        std::basic_string<Char> const& capture = get_capture_as_string_functor(first, last, flag, id, context);
        context.set_value(capture);
    }

private:
    basic_get_capture_as_string_type get_capture_as_string_functor;
};

template<typename Attribute, typename Char, typename Idtype>
basic_set_val_from_capture<Attribute, Char, Idtype> set_val_from_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_set_val_from_capture<Attribute, Char, Idtype>(token, capture_index);
}

讨论

操作是这样使用的：

variable[set_val_from_capture(variable)]

或者，您可以提供第二个参数作为要使用的捕获索引。它默认为1在大多数情况下似乎合适。

创建函数

set_val_from_capture（或get_capture_as_string或get_capture分别）是一个辅助函数，用于从token_def. 特别是我们需要的是Char制作对应正则表达式的类型。

我不确定这是否可以合理地避免，即使这样，它也会使调用运算符显着复杂化（特别是如果我们努力缓存正则表达式对象而不是每次都重新构建它）。我的疑虑主要来自不确定是否需要与标记化序列字符类型相同的类型Char。token_def我认为它们不必相同。

重复令牌

动作中绝对不愉快的部分是需要提供令牌本身作为重复的参数。

Char但是，如上所述的类型需要令牌并...获取正则表达式！

在我看来，至少在理论上，我们可以基于id动作的参数（我们目前只是忽略）以某种方式“在运行时”获取令牌。token_def但是，无论是从context参数还是词法分析器本身（可以像this通过创建函数一样传递给动作），我都找不到任何基于令牌标识符获取的方法。

可重用性

由于这些是动作，它们在更复杂的场景中并不是真正可重用的（开箱即用）。例如，如果您不仅想获取捕获，还想将其转换为某个数值，则必须以这种方式编写另一个操作，而不是在令牌上执行复杂操作。

起初我试图实现这样的目标：

variable[lex::_val = get_capture_as_string(variable)]

它看起来更灵活，因为您可以轻松地在它周围添加更多代码 - 例如将其包装在一些转换函数中。

但我没能做到。虽然我觉得我不够努力。了解更多关于Boost.Phoenix的信息肯定会对这里有很大帮助。

双重工作

所有这些解决方法都不会阻止我们做双重工作。都在正则表达式解析然后匹配。但正如一开始提到的，似乎没有更好的方法（不改变 Boost.Spirit 本身）。

c++ - 如何使 Boost.Spirit.Lex 标记值成为匹配序列的子字符串（最好通过正则表达式匹配组）

1 回答 1

简单的解决方案

正则表达式解决方案

获取捕获范围的动作

以字符串形式获取捕获的操作

从捕获中分配值的操作

讨论

Related

Reference