python - 打开不支持的压缩类型的 zipfile 静默返回空文件流，而不是抛出异常

Question

似乎让我的头从一个新手错误中解脱出来，我不是新手。我有一个 1.2G 已知良好的 zipfile 'train.zip'，其中包含一个 3.5G 文件'train.csv'。我打开 zipfile 并文件本身没有任何异常（没有LargeZipFile），但生成的文件流似乎是空的。（UNIX 'unzip -c ...'确认它很好） Python 返回的文件对象ZipFile.open()不是可搜索的或可分辨的，所以我无法检查。

Python 发行版是2.7.3 EPD-free 7.3-1 (32-bit)；但对于大拉链应该没问题。操作系统为 MacOS 10.6.6

import csv
import zipfile as zf

zip_pathname = os.path.join('/my/data/path/.../', 'train.zip')
#with zf.ZipFile(zip_pathname).open('train.csv') as z:
z = zf.ZipFile(zip_pathname, 'r', zf.ZIP_DEFLATED, allowZip64=True) # I tried all permutations
z.debug = 1
z.testzip() # zipfile integrity is ok

z1 = z.open('train.csv', 'r') # our file keeps coming up empty?

# Check the info to confirm z1 is indeed a valid 3.5Gb file...
z1i = z.getinfo(file_name)
for att in ('filename', 'file_size', 'compress_size', 'compress_type', 'date_time',  'CRC', 'comment'):
    print '%s:\t' % att, getattr(z1i,att)
# ... and it looks ok. compress_type = 9 ok?
#filename:  train.csv
#file_size: 3729150126
#compress_size: 1284613649
#compress_type: 9
#date_time: (2012, 8, 20, 15, 30, 4)
#CRC:   1679210291

# All attempts to read z1 come up empty?!
# z1.readline() gives ''
# z1.readlines() gives []
# z1.read() takes ~60sec but also returns '' ?

# code I would want to run is:
reader = csv.reader(z1)
header = reader.next()
return reader

score 22 · Accepted Answer

原因是以下因素的组合：

此文件的压缩类型为类型 9：Deflate64/Enhanced Deflate（PKWare 的专有格式，相对于更常见的类型 8）
和一个zipfile错误：它不会为不受支持的压缩类型抛出异常。它过去只是默默地返回一个错误的文件对象[第 4.4.5 节压缩方法]。啊。多么虚伪。更新：我提交了错误 14313，它已在 2012 年修复，因此当压缩类型未知时，它现在会引发 NotImplementedError。

命令行解决方法是解压缩，然后重新压缩，以获得简单的类型 8: Deflated。

zipfile 将在 2.7 、 3.2+ 中抛出异常我猜 zipfile 将永远无法真正处理类型 9，出于法律原因。Python 文档没有提到 zipfile 无法处理其他压缩类型:(

score 4 · Accepted Answer

我处理 Python 的 ZipFile 不支持的压缩类型的解决方案是在 ZipFile.extractall 失败时依赖对 7zip 的调用。

from zipfile import ZipFile
import subprocess, sys

def Unzip(zipFile, destinationDirectory):
    try:
        with ZipFile(zipFile, 'r') as zipObj:
            # Extract all the contents of zip file in different directory
            zipObj.extractall(destinationDirectory)
    except:
        print("An exception occurred extracting with Python ZipFile library.")
        print("Attempting to extract using 7zip")
        subprocess.Popen(["7z", "e", f"{zipFile}", f"-o{destinationDirectory}", "-y"])

score 2 · Accepted Answer

压缩类型 9 是 Deflate64/Enhanced Deflate，Python 的 zipfile 模块不支持（本质上是因为 zlib 不支持 Deflate64，zipfile 委托给它）。

如果较小的文件可以正常工作，我怀疑这个 zipfile 是由 Windows Explorer 创建的：对于较大的文件，Windows Explorer 可以决定使用 Deflate64。

（注意 Zip64 与 Deflate64 不同。Zip64 由 Python 的 zipfile 模块支持，只是对一些元数据在 zipfile 中的存储方式进行了一些更改，但仍使用常规 Deflate 来压缩数据。）

但是，stream-unzip现在支持 Deflate64。修改其示例以从本地磁盘读取，并像您的示例一样读取 CSV 文件：

import csv
from io import IOBase, TextIOWrapper
import os

from stream_unzip import stream_unzip

def get_zipped_chunks(zip_pathname):
    with open(zip_pathname, 'rb') as f:
       while True:
           chunk = f.read(65536)
           if not chunk:
               break
           yield chunk

def get_unzipped_chunks(zipped_chunks, filename)
    for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks):
        if file_name != filename:
            for chunk in unzipped_chunks:
                pass
            continue
        yield from unzipped_chunks

def to_str_lines(iterable):
    # Based on the answer at https://stackoverflow.com/a/70639580/1319998
    chunk = b''
    offset = 0
    it = iter(iterable)

    def up_to_iter(size):
        nonlocal chunk, offset

        while size:
            if offset == len(chunk):
                try:
                    chunk = next(it)
                except StopIteration:
                    break
                else:
                    offset = 0
            to_yield = min(size, len(chunk) - offset)
            offset = offset + to_yield
            size -= to_yield
            yield chunk[offset - to_yield:offset]

    class FileLikeObj(IOBase):
        def readable(self):
            return True
        def read(self, size=-1):
            return b''.join(up_to_iter(float('inf') if size is None or size < 0 else size))

    yield from TextIOWrapper(FileLikeObj(), encoding='utf-8', newline='')

zipped_chunks = get_zipped_chunks(os.path.join('/my/data/path/.../', 'train.zip'))
unzipped_chunks = get_unzipped_chunks(zipped_chunks, b'train.csv')
str_lines = to_str_lines(unzipped_chunks)
csv_reader = csv.reader(str_lines)

for row in csv_reader:
    print(row)

python - 打开不支持的压缩类型的 zipfile 静默返回空文件流，而不是抛出异常

3 回答 3

Related

Reference