c++ - 使用 c/c++ 以编程方式从 html 文件中提取表

Question

我正在寻找从 html 文件中提取表格的更好的想法。现在我正在使用 tidy ( http://tidy.sourceforge.net/ ) 将 html 文件转换为 xhtml，然后我使用 rapidxml 来解析 xml。在解析时，我将查找<table>、<tr>和<td>节点，从而创建我的表数据结构。

它工作得很好，但我想知道是否有更好的方法来完成我的任务。此外，整洁的库似乎是一个废弃的项目。

还有大家有没有试过整洁源代码中的“实验性”补丁？

谢谢，克里斯蒂安

score 0 · Accepted Answer

我觉得你的方法还可以。我认为最好的方法是整理并将html转换为xhtml并解析xml。看不到如何简化。

你没有提到任何问题，所以我不确定问题是什么。

score 0 · Accepted Answer

你可以使用 htmlparser ( https://github.com/HamedMasafi/htmlparser ) 这个库可以解析、读取和修改 html 和 css

例如，在您阅读表格的情况下


    html_parser html;
    html.set_text(html_text);
    auto table = html.query("#table_id").at(0);
    for (auto tr : table->childs()) {
        for (auto td : tr->childs()) {
            //now here you have a td and you are free to any modify are data read
            //e.g:
            auto td_tag = dynamic_cast<html_tag*>(td);
            td_tag->set_attr("id", "new_id"); // change attr
            auto id = td_tag->attr("id");
            auto test = td_tag->innser_text();
            auto html = td_tag->outter_html();
        }
    }

快速入门示例在这里

c++ - 使用 c/c++ 以编程方式从 html 文件中提取表

2 回答 2

Related

Reference