4

问题

如何使用 C++ 缩小 HTML?

资源

外部库可能是答案,但我更希望改进我当前的代码。尽管我对其他可能性充满了兴趣。

当前代码

这是我在 c++ 中对以下答案的解释

我必须从原始帖子中更改的唯一部分是顶部的这部分:“(?ix)”
......以及一些逃生标志

#include <boost/regex.hpp>
void minifyhtml(string* s) {
  boost::regex nowhitespace(
    "(?ix)"
    "(?>"           // Match all whitespans other than single space.
    "[^\\S ]\\s*"   // Either one [\t\r\n\f\v] and zero or more ws,
    "| \\s{2,}"     // or two or more consecutive-any-whitespace.
    ")"             // Note: The remaining regex consumes no text at all...
    "(?="           // Ensure we are not in a blacklist tag.
    "[^<]*+"        // Either zero or more non-"<" {normal*}
    "(?:"           // Begin {(special normal*)*} construct
    "<"             // or a < starting a non-blacklist tag.
    "(?!/?(?:textarea|pre|script)\\b)"
    "[^<]*+"        // more non-"<" {normal*}
    ")*+"           // Finish "unrolling-the-loop"
    "(?:"           // Begin alternation group.
    "<"             // Either a blacklist start tag.
    "(?>textarea|pre|script)\\b"
    "| \\z"         // or end of file.
    ")"             // End alternation group.
    ")"             // If we made it here, we are not in a blacklist tag.
  );
  
  // @todo Don't remove conditional html comments
  boost::regex nocomments("<!--(.*)-->");
  
  *s = boost::regex_replace(*s, nowhitespace, " ");
  *s = boost::regex_replace(*s, nocomments, "");
}

只有第一个正则表达式来自原始帖子,另一个是我正在研究的东西,应该被认为远未完成。它应该希望能很好地了解我试图完成的工作。

4

1 回答 1

1

Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.

You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.

I think you might be able to use xml parser or you could search for xml parser with html support.

In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.

Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.

Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

于 2013-06-12T05:53:34.093 回答