我们需要一个 DOM 解析器,它能够运行一堆模式并存储结果。为此,我们正在寻找开放的图书馆,我们可以开始,
- 能够通过正则表达式选择元素(例如,在类、id、元属性等其他属性中获取所有包含“价格”的元素),
- 应该有很多助手,例如:删除评论、iframe 等
- 并且非常快。
- 可以从浏览器扩展运行。
我们需要一个 DOM 解析器,它能够运行一堆模式并存储结果。为此,我们正在寻找开放的图书馆,我们可以开始,
好的,我会说:
你可以使用jQuery。
UPS:
缺点:
这是一些 jquery 操作的示例:
// select all the iframe elements with the class advertisement
// that have the word "porn" in their src attribute
$('iframe.advertisement[src*=porn]')
// filter the ones that contains the word "poney" in their title
// with the help of a regex
.filter(function(){
return /poney/gi.test((this.title || this.document.title).test()));
})
// and remove them
.remove()
// return to the whole match
.end()
// filter them again, this time
// affect only the big ones
.filter(function(){
return $(this).width() > 100 && $(this).height() > 100;
})
// replace them with some html markup
.replaceWith('<img src="harmless_bunnies_and_kitties.jpg" />');
node-htmlparser can parse HTML, provides a DOM with a number of utils (also supports filtering by functions) and can be run in any context (even in WebWorkers).
I forked it a while back, improved it for better speed and got some insane results (read: even faster than native libexpat bindings).
Nevertheless, I would advice you to use the original version, as it supports browsers out-of-the-box (my fork can be run in browsers using browserify, which adds some overhead).