1

我需要处理大型 gzip 压缩文本文件。

InputStream is = new GZIPInputStream(new FileInputStream(path));
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = br.readLine()) != null) {
    someComputation();  
}

如果我不在循环内进行任何长时间的计算(我必须这样做),则此代码有效。但是只为每行添加几毫秒的睡眠时间会导致程序最终崩溃,并出现 java.util.zip.ZipException。异常的消息每次都不同(“无效的文字/长度代码”、“无效的块类型”、“无效的存储块长度”)。
因此,当我没有足够快地阅读它时,似乎流会损坏。

我可以毫无问题地解压缩文件。我还尝试了来自 Apache Commons Compress 的 GzipCompressorInputStream,结果相同。
这里有什么问题,如何解决?

更新 1

我以为我已经排除了这一点,但做了更多测试,我发现问题仅限于来自互联网的流文件。

完整示例:

URL source = new URL(url);      
HttpURLConnection connection = (HttpURLConnection) source.openConnection();
connection.setRequestMethod("GET"); 
connection.setRequestProperty("Accept", "gzip, deflate"); 
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));        
String line;
while ((line = br.readLine()) != null) { //exception is thrown here
    Thread.sleep(5);  
}

有趣的是,当我打印行号时,我发现它总是相同的四五个不同的行之一,我的程序崩溃了。


更新 2

这是一个包含实际文件的完整示例:

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.zip.GZIPInputStream;


public class TestGZIPStreaming {

    public static void main(String[] args) throws IOException {

        URL source = new URL("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");      
        HttpURLConnection connection = (HttpURLConnection) source.openConnection();
        connection.setRequestMethod("GET"); 
        connection.setRequestProperty("Accept", "gzip, deflate"); 
        BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream())));       

        String line;
        int n = 0;

        while ((line = br.readLine()) != null) { //exception is thrown here
            Thread.sleep(10);  
            System.out.println(++n);
        }

    }

}

对于这个文件,崩溃出现在第 90000 行附近

。为了排除超时问题,我尝试过connection.setReadTimeout(0)- 没有效果。

应该是网络问题。但是因为我可以在浏览器中下载文件,所以必须有一种方法来处理它。

更新 3

我尝试使用 Apache HttpClient 进行连接。

HttpClient client = HttpClients.createDefault();
HttpGet get = new HttpGet("http://tools.wmflabs.org/wikidata-exports/rdf/exports/20151130/wikidata-statements.nt.gz");
get.addHeader("Accept-Encoding", "gzip");
HttpResponse response = client.execute(get);
BufferedReader br = new BufferedReader(new InputStreamReader(new GZIPInputStream(new BufferedInputStream(response.getEntity().getContent()))));

现在我得到了以下异常,这可能更有帮助。

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 3850131; received: 1581056
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:238)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)

同样,必须有一种方法来处理这个问题,因为我可以在浏览器中下载文件并解压缩它而没有任何问题。

4

0 回答 0