I would like to extract only javascript from script tags in a HTML document which I want to pass it to a JS parser like esprima. I am using nodejs to write this application and have the content extracted from the script tag as a string.
The problem is when there are HTML comments in the javascript extracted from html documents which I want to remove.
<!-- var a; -->
should be converted to var a
A simple removal of <--
and -->
does not work since it fails in the case <!-- if(j-->0); -->
where it removes the middle -->
I would also like to remove identifiers like [if !IE]
and [endif]
which are sometimes found inside script tags.
I would also like to extract the JS inside CDATA segments.
<![CDATA[ var a; ]]>
should be converted to var a
Is all this possible using a regex or is something more required?
In short I would like to sanitize the JS from script tags so that I can safely pass it into a parser like esprima.
Thanks!
EDIT:
Based on @user568109 's answer. This is the rough code that parses through HTML comments and CDATA segments inside script tags
var htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Pavar htmlparser = require("htmlparser2");
var jstext = '';
var parser = new htmlparser.Parser({
onopentag: function(name, attribs){
if(name === "script" && attribs.type === "text/javascript"){
jstext = '';
//console.log("JS! Hooray!");
}
},
ontext: function(text) {
jstext += text;
},
onclosetag: function(tagname) {
if(tagname === "script") {
console.log(jstext);
jstext = '';
}
},
oncomment : function(data) {
if(jstext) {
jstext += data;
}
}
}, {
xmlMode:true
});
parser.write(input);
parser.end()