它必须是Java解决方案吗?PhantomJs结合pjscrape可以对页面进行站点抓取以查找所有 url。
您只需要创建一个配置 javascript 文件。
获取链接.js:
pjs.addSuite({
url: 'http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html',
noConflict: true,
scraper: function() {
var links = _pjs.$('a').map(function() {
// convert relative URLs to absolute
var link = _pjs.toFullUrl($(this).attr('href'));
return link;
});
return links.toArray();
}
});
pjs.config({
// options: 'stdout' or 'file' (set in config.outFile)
log: 'stdout',
// options: 'json' or 'csv'
format: 'json',
// options: 'stdout' or 'file' (set in config.outFile)
writer: 'stdout',
scrape_output.json
});
并运行命令phantomjs pjscrape.js getlinks.js
。在此示例中,输出存储在一个文件中(也可以记录在控制台中):
这是(部分)输出:
* Suite 0 starting
* Opening http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Scraping http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Suite 0 complete
* Writing 145 items
["http://stackoverflow.com/users/login?returnurl=%2fquestions%2f14138297%2freplace-all-urls-in-a-html","http://careers.stackoverflow.com","http://chat.stackoverflow.com","http://meta.stackoverflow.com","http://stackoverflow.com/about","http://stackoverflow.com/faq","http://stackoverflow.com/","http://stackoverflow.com/questions","http://stackoverflow.com/tags","http://stackoverflow.com/users","http://stackoverflow.com/badges","http://stackoverflow.com/unanswered","http://stackoverflow.com/questions/ask", ...
"http://creativecommons.org/licenses/by-sa/3.0/","http://creativecommons.org/licenses/by-sa/3.0/","http://blog.stackoverflow.com/2009/06/attribution-required/"]
* Saved 145 items