java - 将网页内容读入 Java 字符串的最佳方法是什么？

Question

我有以下 Java 代码来获取给定 URL 处 HTML 页面的全部内容。这可以以更有效的方式完成吗？欢迎任何改进。

public static String getHTML(final String url) throws IOException {
    if (url == null || url.length() == 0) {
        throw new IllegalArgumentException("url cannot be null or empty");
    }

    final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
    final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    final StringBuilder page = new StringBuilder();
    final String lineEnd = System.getProperty("line.separator");
    String line;
    try {
        while (true) {
            line = buf.readLine();
            if (line == null) {
                break;
            }
            page.append(line).append(lineEnd);
        }
    } finally {
        buf.close();
    }

    return page.toString();
}

我不禁觉得线路阅读不是最佳的。我知道我可能掩盖了电话MalformedURLException引起的问题openConnection，我对此没有意见。

我的函数还具有使 HTML 字符串具有当前系统的正确行终止符的副作用。这不是要求。

我意识到网络 IO 可能会使读取 HTML 所需的时间相形见绌，但我仍然想知道这是最优的。

附带说明：如果StringBuilder有一个 open 的构造函数，InputStream它会简单地获取所有内容InputStream并将其读入StringBuilder.

score 10 · Accepted Answer

正如在其他答案中所看到的，在任何强大的解决方案中都应该考虑到许多不同的边缘情况（HTTP 特性、编码、分块等）。因此，我建议在除玩具程序之外的任何东西中使用事实上的 Java 标准 HTTP 库：Apache HTTP Components HTTP Client。

他们提供了许多示例，“只是”获取请求的响应内容如下所示：

HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/"); 
ResponseHandler<String> responseHandler = new BasicResponseHandler();    
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();

score 2 · Accepted Answer

好的，再次编辑。一定要把你的 try-finally 块放在它周围，或者捕获 IOException

 ...
 final static int BUFZ = 4096;
 StringBuilder page = new StringBuilder();
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[BUFZ] ;
 int nRead = 0;

 while((nRead = is.read(buf, 0, BUFZ) > 0) {
    page.append(new String(buf /* , Charset charset */)); 
 // uses local default char encoding for now
 }

这里试试这个：

 ...
 final static int MAX_SIZE = 10000000;
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 // perhaps allocate this one time and reuse if you
  //call this method a lot.
 byte[] buf = new byte[MAX_SIZE] ;
 int nRead = 0;
 int total = 0;
 // you could also use ArrayList so that you could dynamically
 //  resize or there are other ways to resize an array also
 while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
      total += nRead;
 }
 ...
 // do something with buf array of length total

好的，下面的代码对您不起作用，因为由于 HTTP/1.1“分块”，Content-length 标题行在开始时没有被发送

 ...
 HttpURLConnection conn = 
    (HttpURLConnection) new URL(url).openConnection();
 InputStream is = conn.getInputStream()
 int cLen = conn.getContentLength() ;
 byte[] buf = new byte[cLen] ;
 int nRead=0 ;

 while(nRead < cLen) {
      nRead += is.read(buf, nRead, cLen - nRead) ;
 }
 ...
 // do something with buf array

score 1 · Accepted Answer

您可以通过将更大的块读入字符数组并将数组内容附加到 StringBuilder 来在 InputStreamReader 之上进行自己的缓冲。

但这会使您的代码更难理解，我怀疑这是否值得。

请注意，Sean AO Harney 的提议读取原始字节，因此您需要在此基础上转换为文本。

java - 将网页内容读入 Java 字符串的最佳方法是什么？

3 回答 3

Related

Reference