我需要处理大型 gzip 压缩文本文件。
InputStream is = new GZIPInputStream(new FileInputStream(path));
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = br.readLine()) != null) {
someComputation();
}
如果我不在循环内进行任何长时间的计算(我必须这样做),则此代码有效。但是只为每行添加几毫秒的睡眠时间会导致程序最终崩溃,并出现 java.util.zip.ZipException。异常的消息每次都不同(“无效的文字/长度代码”、“无效的块类型”、“无效的存储块长度”)。
因此,当我没有足够快地阅读它时,似乎流会损坏。
我可以毫无问题地解压缩文件。我还尝试了来自 Apache Commons Compress 的 GzipCompressorInputStream,结果相同。
这里有什么问题,如何解决?
更新 1
我以为我已经排除了这一点,但做了更多测试,我发现问题仅限于来自互联网的流文件。
完整示例:
URL source = new URL(url);
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "gzip, deflate");
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
String line;
while ((line = br.readLine()) != null) { //exception is thrown here
Thread.sleep(5);
}
有趣的是,当我打印行号时,我发现它总是相同的四五个不同的行之一,我的程序崩溃了。
更新 2
这是一个包含实际文件的完整示例:
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.zip.GZIPInputStream;
public class TestGZIPStreaming {
public static void main(String[] args) throws IOException {
URL source = new URL("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Accept", "gzip, deflate");
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));
String line;
int n = 0;
while ((line = br.readLine()) != null) { //exception is thrown here
Thread.sleep(10);
System.out.println(++n);
}
}
}
对于这个文件,崩溃出现在第 90000 行附近
。为了排除超时问题,我尝试过connection.setReadTimeout(0)
- 没有效果。
应该是网络问题。但是因为我可以在浏览器中下载文件,所以必须有一种方法来处理它。
更新 3
我尝试使用 Apache HttpClient 进行连接。
HttpClient client = HttpClients.createDefault();
HttpGet get = new HttpGet("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
get.addHeader("Accept-Encoding", "gzip");
HttpResponse response = client.execute(get);
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(new BufferedInputStream(response.getEntity().getContent()))));
现在我得到了以下异常,这可能更有帮助。
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 3850131; received: 1581056
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
同样,必须有一种方法来处理这个问题,因为我可以在浏览器中下载文件并解压缩它而没有任何问题。