3

我正在将我的备份脚本从 shell 转换为 Python。我的旧脚本的功能之一是通过执行以下操作检查创建的 tar 文件的完整性: gzip -t 。

这在 Python 中似乎有点棘手。

似乎这样做的唯一方法是读取 tarfile 中的每个压缩的 TarInfo 对象。

有没有一种方法可以检查 tar 文件的完整性,而不需要提取到磁盘或将其保存在内存中(完整地)?

freenode 上#python 上的好人建议我应该逐块读取每个 TarInfo 对象,丢弃读取的每个块。

我必须承认我不知道如何做到这一点,因为我刚刚开始使用 Python。

想象一下,我有一个 30GB 的 tarfile,其中包含从 1kb 到 10GB 的文件......

这是我开始编写的解决方案:

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

for member_info in tardude.getmembers():
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

这段代码远未完成。我不敢在一个巨大的 30GB tar 存档上运行它,因为在某一时刻,检查将是 10+GB 的对象(如果我在 tar 存档中有这么大的文件)

奖励:我尝试手动破坏 zero.tar.gz(十六进制编辑器 - 编辑几个字节的中间文件)。第一个 except 没有捕获 IOError ......这是输出:

Traceback (most recent call last):
  File "./test.py", line 31, in <module>
    for member_info in tardude.getmembers():
  File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
    tarinfo = self.next()
  File "/usr/lib/python2.7/tarfile.py", line 2315, in next
    self.fileobj.seek(self.offset)
  File "/usr/lib/python2.7/gzip.py", line 429, in seek
    self.read(1024)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 320, in _read
    self._read_eof()
  File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L
4

3 回答 3

3

只是对Aya 的回答进行了微小的改进,使事情变得更加惯用(尽管我正在删除一些错误检查以使机制更加明显):

BLOCK_SIZE = 1024

with tarfile.open("zero.tar.gz") as tardude:
    for member in tardude.getmembers():
        with tardude.extractfile(member.name) as target:
            for chunk in iter(lambda: target.read(BLOCK_SIZE), b''):
                pass

这实际上只是消除了while 1:(有时被认为是轻微的代码气味)和if not data:检查。另请注意,使用将其with限制为 Python 2.7+

于 2015-08-31T13:48:41.553 回答
2

我尝试手动破坏 zero.tar.gz(十六进制编辑器 - 编辑几个字节的中间文件)。第一个 except 没有捕获 IOError ......

如果您查看回溯,您会看到它在调用时被抛出tardude.getmembers(),因此您需要类似...

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

try:
    members = tardude.getmembers()
except:
    print "There was an error reading tarfile members."

for member_info in members:
    try:
        check = tardude.extractfile(member_info.name)
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

至于原来的问题,你几乎就在那里。你只需要从你的check对象中读取数据,比如......

BLOCK_SIZE = 1024

try:
    tardude = tarfile.open("zero.tar.gz")
except:
    print "There was an error opening tarfile. The file might be corrupt or missing."

try:
    members = tardude.getmembers()
except:
    print "There was an error reading tarfile members."

for member_info in members:
    try:            
        check = tardude.extractfile(member_info.name)
        while 1:
            data = check.read(BLOCK_SIZE)
            if not data:
                break
    except:
        print "File: %r is corrupt." % member_info.name

tardude.close()

...这应该确保您一次不会使用超过BLOCK_SIZE字节的内存。

此外,您应该尽量避免使用...

try:
    do_something()
except:
    do_something_else()

...因为它会掩盖意外的异常。尝试仅捕获您实际打算处理的异常,例如...

try:
    do_something()
except IOError:
    do_something_else()

...否则您会发现更难以检测代码中的错误。

于 2013-04-15T11:23:40.460 回答
1

您可以使用该subprocess模块来调用gzip -t文件...

from subprocess import call
import os

with open(os.devnull, 'w') as bb:
    result = call(['gzip', '-t', "zero.tar.gz"], stdout=bb, stderr=bb)

如果result不为 0,则有问题。不过,您可能想检查 gzip 是否可用。我为此编写了一个实用函数;

import subprocess
import sys
import os

def checkfor(args, rv = 0):
    """Make sure that a program necessary for using this script is
    available.

    Arguments:
    args  -- string or list of strings of commands. A single string may
             not contain spaces.
    rv    -- expected return value from evoking the command.
    """
    if isinstance(args, str):
        if ' ' in args:
            raise ValueError('no spaces in single command allowed')
        args = [args]
    try:
        with open(os.devnull, 'w') as bb:
            rc = subprocess.call(args, stdout=bb, stderr=bb)
        if rc != rv:
            raise OSError
    except OSError as oops:
        outs = "Required program '{}' not found: {}."
        print(outs.format(args[0], oops.strerror))
        sys.exit(1)
于 2013-04-15T11:21:52.140 回答