c++ - 使用 c++/boost::regex 提取 HTML 文件的特定部分

Question

我有一系列数千个 HTML 文件，为了运行词频计数器的最终目的，我只对每个文件的特定部分感兴趣。例如，假设以下是其中一个文件的一部分：

<!-- Lots of HTML code up here -->
<div class="preview_content clearfix module_panel">
      <div class="textelement   "><div><div><p><em>"Portion of interest"</em></p></div>
</div>
<!-- Lots of HTML code down here -->

我应该如何在 c++ (boost::regex) 中使用正则表达式来提取示例中突出显示的特定文本部分并将其放入单独的字符串中？

我目前有一些代码可以打开 html 文件并将整个内容读入单个字符串，但是当我尝试运行boost::regex_match查找特定的 line 开头时<div class="preview_content clearfix module_panel">，我没有得到任何匹配。只要是在 c++ 上，我愿意接受任何建议。

score 1 · Accepted Answer

我应该如何在 c++ (boost::regex) 中使用正则表达式来提取示例中突出显示的特定文本部分并将其放入单独的字符串中？

你没有。

永远不要使用正则表达式来处理 HTML。无论是在带有 Boost.Regex 的 C++ 中，还是在 Perl、Python、JavaScript 中，任何地方和任何地方。HTML 不是常规语言；因此，它不能通过正则表达式以任何有意义的方式处理。哦，在极其有限的情况下，你也许可以让它提取一些特定的信息。但是一旦这些情况发生变化，你会发现自己无法完成你需要完成的事情。

我建议使用实际的 HTML 解析器，例如LibXML2（它确实能够读取 HTML4）。但是使用正则表达式来解析 HTML 只是使用了错误的工具来完成这项工作。

score 1 · Accepted Answer

由于我所需要的只是一些非常简单的东西（根据上面的问题），我能够在不使用正则表达式或任何类型的解析的情况下完成它。以下是成功的代码片段：

    // Read HTML file into string variable str
    std::ifstream t("/path/inputFile.html");
    std::string str((std::istreambuf_iterator<char>(t)), std::istreambuf_iterator<char>());

    // Find the two "flags" that enclose the content I'm trying to extract
    size_t pos1 = str.find("<div class=\"preview_content clearfix module_panel\">");
    size_t pos2 = str.find("</em></p></div>");

    // Get that content and store into new string
    std::string buf = str.substr(pos1,pos2-pos1);

谢谢你指出我完全走错了路。

c++ - 使用 c++/boost::regex 提取 HTML 文件的特定部分

2 回答 2

Related

Reference