java - 像 wget 在客户端使用 Java 一样下载所有图像

Question

使用 wget 从网站下载所有图像非常容易。

但我需要在客户端使用此功能，最好是在 Java 中。

我知道 wget 的源代码可以在线访问，但我不会任何 C 并且源代码相当复杂。当然，wget 还有其他对我来说“炸毁源代码”的功能。

由于Java有一个内置的HttpClient，但我不知道 wget 到底有多复杂，你能告诉我在Java中重新实现“递归下载所有图像”功能是否很难？

这到底是怎么做的？wget 是否获取给定 URL 的 HTML 源代码，从 HTML 中提取具有给定文件结尾（.jpg、.png）的所有 URL 并下载它们？它是否还在该 HTML 文档中链接的样式表中搜索图像？

你会怎么做？您会使用正则表达式在 HTML 文档中搜索（相对和绝对）图像 URL 并HttpClient下载它们吗？还是已经有一些 Java 库可以做类似的事情？

score 2 · Accepted Answer

2

在 Java 中，您可以使用Jsoup库来解析任何网页并提取您想要的任何内容

于 2013-09-10T11:04:48.927 回答

score 0 · Accepted Answer

对我来说crawler4j是递归爬取（和复制）站点的开源库，例如像这样（他们的 QuickStart 示例）：（它还支持 CSS URL 爬取）

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp3|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "http://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("http://www.ics.uci.edu/");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);

         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
         }
    }
}

更多网络爬虫和 HTML 解析器可以在这里找到。

score -1 · Accepted Answer

找到这个下载图像的程序。它是开源的。

您可以使用<IMG>标签在网站中获取图像。看看下面的问题。它可能会帮助你。从网页程序中获取所有图像 | 爪哇

java - 像 wget 在客户端使用 Java 一样下载所有图像

3 回答 3

Related

Reference