java - 替换 HTML 中的所有 URL

Question

我正在使用crawler4j抓取一些 HTML 文件，我想用自定义链接替换这些页面中的所有链接。目前，我可以使用以下代码获取源 HTML 和所有传出链接的列表：

        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String html = htmlParseData.getHtml();
        List<WebURL> links = htmlParseData.getOutgoingUrls();

然而，一个简单的foreach循环和搜索和替换不会让我得到我想要的。问题是 atheWebURL.getURL();将返回绝对 URL，但有时链接是相对的，有时不是。

我想处理所有链接（图像、URL、JavaScript 文件等）。例如我想images/img.gif用view.php?url=http://www.domain.com/images/img.gif.

我唯一想到的解决方案是使用有点复杂的方法Regex，但恐怕我会错过一些罕见的情况。这已经完成了吗？是否有图书馆或一些工具来实现这一目标？

score 0 · Accepted Answer

它必须是Java解决方案吗？PhantomJs结合pjscrape可以对页面进行站点抓取以查找所有 url。

您只需要创建一个配置 javascript 文件。

获取链接.js：

pjs.addSuite({
    url: 'http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html',
    noConflict: true,
    scraper: function() {
          var links = _pjs.$('a').map(function() {
           // convert relative URLs to absolute
           var link = _pjs.toFullUrl($(this).attr('href'));
           return link;
      });
      return links.toArray();
    }
});
pjs.config({ 
  // options: 'stdout' or 'file' (set in config.outFile)
    log: 'stdout',
    // options: 'json' or 'csv'
    format: 'json',
    // options: 'stdout' or 'file' (set in config.outFile)
    writer: 'stdout',
    scrape_output.json
});

并运行命令phantomjs pjscrape.js getlinks.js。在此示例中，输出存储在一个文件中（也可以记录在控制台中）：

这是（部分）输出：

* Suite 0 starting
* Opening http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Scraping http://stackoverflow.com/questions/14138297/replace-all-urls-in-a-html
* Suite 0 complete
* Writing 145 items
["http://stackoverflow.com/users/login?returnurl=%2fquestions%2f14138297%2freplace-all-urls-in-a-html","http://careers.stackoverflow.com","http://chat.stackoverflow.com","http://meta.stackoverflow.com","http://stackoverflow.com/about","http://stackoverflow.com/faq","http://stackoverflow.com/","http://stackoverflow.com/questions","http://stackoverflow.com/tags","http://stackoverflow.com/users","http://stackoverflow.com/badges","http://stackoverflow.com/unanswered","http://stackoverflow.com/questions/ask", ...
"http://creativecommons.org/licenses/by-sa/3.0/","http://creativecommons.org/licenses/by-sa/3.0/","http://blog.stackoverflow.com/2009/06/attribution-required/"]
* Saved 145 items

score 0 · Accepted Answer

我认为您可以为此使用正则表达式：

例如：

  ...
   String regex = "\\/[^.]*\\/[^.]*\\.";
   Pattern pattern =  Pattern.compile(regex);
   Matcher  matcher = pattern.matcher(text);

   while(matcher.find()){
    String imageLink =  matcher.group();
    text = text.replace(imageLink,prefix+imageLink);
   }

java - 替换 HTML 中的所有 URL

2 回答 2

Related

Reference