java - 在用 Java 获取网站的 HTML 时需要帮助

Question

我从java httpurlconnection 中获得了一些代码，切断了 html，而我从 Java 网站获取 html 的代码几乎相同。除了我无法使用此代码的一个特定网站：

我正在尝试从该网站获取 HTML：

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

但我不断收到垃圾字符。尽管它适用于任何其他网站，例如http://www.google.com。

这是我正在使用的代码：

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

我不明白为什么它不适用于我上面提到的 URL。

任何帮助将不胜感激。

score 7 · Accepted Answer

无论客户端的功能如何，该站点都会错误地压缩响应。通常，只要客户端支持（通过Accept-Encoding: gzip），服务器就应该只压缩响应。您需要使用GZIPInputStream.

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

请注意，我还向InputStreamReader构造函数添加了正确的字符集。通常，您希望从Content-Type响应的标头中提取它。

有关更多提示，另请参阅如何使用 URLConnection 触发和处理 HTTP 请求？如果您想要的只是从 HTML 中解析/提取信息，那么我强烈建议您改用Jsoup 之类的HTML 解析器。

java - 在用 Java 获取网站的 HTML 时需要帮助

1 回答 1

Related

Reference