python-2.7 - 在 Python 2.7 中使用 zlib 从 url 读取 gzip 文件

Question

我正在尝试从 url 读取 gzip 文件，而不在 Python 2.7 中保存临时文件。但是，由于某种原因，我得到了一个截断的文本文件。我花了很长时间在网上搜索解决方案，但没有成功。如果我将“原始”数据保存回 gzip 文件，则不会截断（请参阅下面的示例代码）。我究竟做错了什么？

我的示例代码：

    import urllib2
    import zlib
    from StringIO import StringIO

    url = "ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/clinvar_00-latest.vcf.gz"

    # Create a opener
    opener = urllib2.build_opener() 

    request = urllib2.Request(url)
    request.add_header('Accept-encoding', 'gzip')

    # Fetch the gzip filer
    respond = opener.open(request)
    compressedData = respond.read()
    respond.close()

    opener.close()

    # Extract data and save to text file
    compressedDataBuf = StringIO(compressedData)
    d = zlib.decompressobj(16+zlib.MAX_WBITS)

    buffer = compressedDataBuf.read(1024)
    saveFile = open('/tmp/test.txt', "wb")
    while buffer:
        saveFile.write(d.decompress(buffer))
        buffer = compressedDataBuf.read(1024)
    saveFile.close()

    # Save "raw" data to new gzip file.
    saveFile = open('/tmp/test.gz', "wb")
    saveFile.write(compressedData)
    saveFile.close()

score 0 · Accepted Answer

因为该 gzip 文件包含许多串联的 gzip 流，这是 RFC 1952 所允许的。gzip 会自动解压缩所有 gzip 流。

您需要检测每个 gzip 流的结束，并使用后续压缩数据重新开始解压缩。unused_data在 Python 文档中查看。

python-2.7 - 在 Python 2.7 中使用 zlib 从 url 读取 gzip 文件

1 回答 1

Related

Reference