1

I would like to extract only javascript from script tags in a HTML document which I want to pass it to a JS parser like esprima. I am using nodejs to write this application and have the content extracted from the script tag as a string. The problem is when there are HTML comments in the javascript extracted from html documents which I want to remove.
<!-- var a; --> should be converted to var a
A simple removal of <-- and --> does not work since it fails in the case <!-- if(j-->0); --> where it removes the middle -->
I would also like to remove identifiers like [if !IE] and [endif] which are sometimes found inside script tags. I would also like to extract the JS inside CDATA segments.
<![CDATA[ var a; ]]> should be converted to var a
Is all this possible using a regex or is something more required?
In short I would like to sanitize the JS from script tags so that I can safely pass it into a parser like esprima.
Thanks!

EDIT:
Based on @user568109 's answer. This is the rough code that parses through HTML comments and CDATA segments inside script tags

var htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Pavar htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
    if(name === "script" && attribs.type === "text/javascript"){
        jstext = '';
        //console.log("JS! Hooray!");
    }
},
ontext: function(text) {
    jstext += text;
},
onclosetag: function(tagname) {
    if(tagname === "script") {
        console.log(jstext);
        jstext = '';
    }
},
oncomment : function(data) {
    if(jstext) {
        jstext += data;
    }
}
},  {
xmlMode:true
});
parser.write(input);
parser.end()
4

1 回答 1

0

这就是解析器的工作。请参阅htmlparser2或 esprima 本身。请不要使用正则表达式来解析 HTML,它很诱人。您将浪费宝贵的时间和精力尝试匹配更多标签。

页面中的一个示例:

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
    onopentag: function(name, attribs){
        if(name === "script" && attribs.type === "text/javascript"){
            console.log("JS! Hooray!");
        }
    },
    ontext: function(text){
        console.log("-->", text);
    },
    onclosetag: function(tagname){
        if(tagname === "script"){
            console.log("That's it?!");
        }
    }
});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</script>");
parser.end();

输出(简化):

--> Xyz 
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!

它将为您提供所有标签 div、评论、脚本等。但您必须自己验证评论中的脚本。也是CDATAXML(XHTML) 中的有效标记,因此 htmlparser2 会将其检测为注释,您也必须检查这些标记。

于 2013-07-19T12:08:16.273 回答