c++ - Using Boost.Spirit to extract certain tags/attributes from HTML

Question

So I've been learning a bit about Boost.Spirit to replace the use of regular expressions in a lot of my code. The main reason is pure speed. I've found Boost.Spirit to be up to 50 times faster than PCRE for some relatively simple tasks.

One thing that is a big bottleneck in one of my apps is taking some HTML, finding all "img" tags, and extracting the "src" attribute.

This is my current regex:

(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)

I've been playing around with it trying to get something to work in Spirit, but so far I've come up empty. Any tips on how to create a set of Spirit rules that will accomplish the same thing as this regex would be awesome.

score 2 · Accepted Answer

当然，也不能错过 Boost Spirit 变体：

sehe@natty:/tmp$ time ./spirit < bench > /dev/null

real    0m3.895s
user    0m3.820s
sys 0m0.070s

老实说，Spirit 代码比其他变体更通用：

它实际上更智能地解析属性，因此可以很容易地同时处理各种属性，可能取决于包含元素
Spirit解析器会更容易适应跨行匹配。这可能最容易实现
- 使用spirit::istream_iterator<>（不幸的是，这是出了名的慢）
- 使用带有 rawconst char*作为迭代器的内存映射文件；后一种方法同样适用于其他技术

代码如下：（完整代码在https://gist.github.com/c16725584493b021ba5b）

//#define BOOST_SPIRIT_DEBUG
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi  = boost::spirit::qi;
namespace phx = boost::phoenix;

void handle_attr(
        const std::string& elem, 
        const std::string& attr, 
        const std::string& value)
{
    if (elem == "img" && attr == "src")
        std::cout << "value : " << value << std::endl;
}

typedef std::string::const_iterator It;
typedef qi::space_type Skipper;

struct grammar : qi::grammar<It, Skipper>
{
    grammar() : grammar::base_type(html)
    {
        using namespace boost::spirit::qi;
        using phx::bind;

        attr = as_string [ +~char_("= \t\r\n/>") ] [ _a = _1 ]
                >> '=' >> (
                    as_string [ '"' >> lexeme [ *~char_('"') ] >> '"' ]
                  | as_string [ "'" >> lexeme [ *~char_("'") ] >> "'" ]
                  ) [ bind(handle_attr, _r1, _a, _1) ]
            ;

        elem = lit('<') 
            >> as_string [ lexeme [ ~char_("-/>") >> *(char_ - space - char_("/>")) ] ] [ _a = _1 ]
            >> *attr(_a);

        html = (-elem) % +("</" | (char_ - '<'));

        BOOST_SPIRIT_DEBUG_NODE(html);
        BOOST_SPIRIT_DEBUG_NODE(elem);
        BOOST_SPIRIT_DEBUG_NODE(attr);
    }

    qi::rule<It, Skipper> html;
    qi::rule<It, Skipper, qi::locals<std::string> > elem;
    qi::rule<It, qi::unused_type(std::string), Skipper, qi::locals<std::string> > attr;
};

int main(int argc, const char *argv[])
{
    std::string s;

    const static grammar html_;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(),
           l = s.end();

        if (!phrase_parse(f, l, html_, qi::space) || (f!=l))
            std::cerr << "unparsed: " << std::string(f,l) << std::endl;
    }

    return 0;
}

score 1 · Accepted Answer

出于好奇，我使用静态编译的正则表达式基于 Boost Xpressive 重新编写了我的正则表达式示例：

sehe@natty:/tmp$ time ./expressive < bench > /dev/null

real    0m2.146s
user    0m2.110s
sys 0m0.030s

有趣的是，使用动态正则表达式时没有明显的速度差异；然而，总体而言，Xpressive 版本的性能优于 Boost Regex 版本（大约 10%）

真正好的，IMO，几乎是包含xpressive.hpp并更改一些命名空间以从 Boost Regex 更改为 Xpressive。API 接口（就其使用而言）完全相同。

相关代码如下：（完整代码在https://gist.github.com/c16725584493b021ba5b）

typedef std::string::const_iterator It;

int main(int argc, const char *argv[])
{
    using namespace boost::xpressive;
#if DYNAMIC
    const sregex re = sregex::compile
         ("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
#else
    const sregex re = "<img" >> +_s >> -*(~(set = '\\','>')) >> 
        "src" >> *_s >> '=' >> *_s
        >> (s1 = as_xpr('"') | '\'') >> (s2 = -*_) >> s1;
#endif

    std::string s;
    smatch what;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(), l = s.end();

        do
        {
            if (!regex_search(f, l, what, re))
                break;

            handle_attr("img", "src", what[2]);
            f = what[0].second;
        } while (f!=s.end());
    }

    return 0;
}

score 1 · Accepted Answer

更新

我做了基准测试。

完整披露在这里：https ://gist.github.com/c16725584493b021ba5b

它包括使用的完整代码、编译标志和使用的测试数据主体（文件bench）。

简而言之

正则表达式在这里确实更快更简单

不要低估我在调试 Spirit 语法以使其正确上所花费的时间！

已采取措施消除“意外”差异（例如

在整个实现中保持handle_attribute不变，即使它主要只对 Spirit 实现有意义）。

对两者使用相同的逐行输入样式和字符串迭代器

现在，所有三种实现都产生完全相同的输出

一切都在 g++ 4.6.1（c++03 模式）上构建/计时，-O3

编辑以回复您不应该使用正则表达式解析 HTML的下意识（和正确）响应：

您不应该使用正则表达式来解析重要的输入（主要是任何带有语法的东西。当然 Perl 5.10+ '正则表达式语法'是一个例外，因为它们不再是孤立的正则表达式

HTML 基本无法解析，是非标准的标签汤。严格（X）HTML，是另一回事

根据 Xaade 的说法，如果您没有足够的时间使用符合标准的 HTML 阅读器生成完美的实现，您应该

“问客户他们要不要狗屎。如果他们要狗屎，你就向他们收取更多费用。狗屎比他们花费你更多。” ——夏德

^{也就是说，在某些情况下，我会完全按照我在这里的建议去做：使用正则表达式。主要是，如果是一次性快速搜索或每天获取已知数据的粗略统计等。YMMV，你应该自己打电话。}

有关时间安排和摘要，请参阅：

提升下面的正则表达式答案

在这里提升 Xpressive 答案

精神答案在这里

我衷心建议在这里使用正则表达式：

typedef std::string::const_iterator It;

int main(int argc, const char *argv[])
{
    const boost::regex re("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");

    std::string s;
    boost::smatch what;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(), l = s.end();

        do
        {
            if (!boost::regex_search(f, l, what, re))
                break;

            handle_attr("img", "src", what[2]);
            f = what[0].second;
        } while (f!=s.end());
    }
    
    return 0;
}

像这样使用它：

./test < index.htm

我看不出为什么基于精神的方法应该/可以更快？

编辑PS。如果您声称静态优化将是关键，为什么不将其转换为 Boost Expressive、静态、正则表达式？

c++ - Using Boost.Spirit to extract certain tags/attributes from HTML

3 回答 3

Related

Reference