python - Python gzip 拒绝读取未压缩的文件

Question

我似乎记得 Python gzip 模块以前允许您透明地读取非 gzip 文件。这真的很有用，因为它允许读取输入文件，无论它是否经过 gzip 压缩。你根本不必担心它。

现在，我得到一个 IOError 异常（在 Python 2.7.5 中）：

   Traceback (most recent call last):
  File "tst.py", line 14, in <module>
    rec = fd.readline()
  File "/sw/lib/python2.7/gzip.py", line 455, in readline
    c = self.read(readsize)
  File "/sw/lib/python2.7/gzip.py", line 261, in read
    self._read(readsize)
  File "/sw/lib/python2.7/gzip.py", line 296, in _read
    self._read_gzip_header()
  File "/sw/lib/python2.7/gzip.py", line 190, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file

如果有人有一个巧妙的技巧，我想听听。是的，我知道如何捕捉异常，但我发现先读取一行，然后关闭文件并再次打开它相当笨拙。

score 14 · Accepted Answer

最好的解决方案是将https://github.com/ahupp/python-magic与 libmagic 一起使用。您根本无法避免至少读取标头来识别文件（除非您隐式信任文件扩展名）

如果您感觉简陋，那么识别 gzip(1) 文件的神奇数字是前两个字节是 0x1f 0x8b。

In [1]: f = open('foo.html.gz')
In [2]: print `f.read(2)`
'\x1f\x8b'

gzip.open 只是 GzipFile 的一个包装器，你可以有一个这样的函数，它只返回正确的对象类型，具体取决于源是什么，而不必打开文件两次：

#!/usr/bin/python

import gzip

def opener(filename):
    f = open(filename,'rb')
    if (f.read(2) == '\x1f\x8b'):
        f.seek(0)
        return gzip.GzipFile(fileobj=f)
    else:
        f.seek(0)
        return f

score 9 · Accepted Answer

也许您正在考虑 zless 或 zgrep，它们将打开压缩或未压缩的文件而不会抱怨。

你能相信文件名以 .gz 结尾吗？

if file_name.endswith('.gz'):
    opener = gzip.open
else:
    opener = open

with opener(file_name, 'r') as f:
    ...

score 2 · Accepted Answer

读取前四个字节。如果前三个是 0x1f、0x8b、0x08，并且如果第四个字节的高三位为零，则从这四个字节开始启动 gzip 压缩。否则写出四个字节并继续透明读取。

你应该仍然有笨重的解决方案来备份它，这样如果 gzip 读取仍然失败，那么备份并透明地读取。但是前四个字节不太可能很好地模仿 gzip 文件，但不是 gzip 文件。

score 1 · Accepted Answer

1

您可以使用fileinput(files, openhook=fileinput.hook_compressed)透明地迭代文件

于 2017-06-10T23:02:31.627 回答

python - Python gzip 拒绝读取未压缩的文件

4 回答 4

Related

Reference