python - 如何保护自己免受 gzip 或 bzip2 炸弹的伤害？

Question

这与有关 zip 炸弹的问题有关，但要考虑 gzip 或 bzip2 压缩，例如接受.tar.gz文件的 Web 服务。

Python 提供了一个方便的tarfile 模块，使用起来很方便，但似乎没有提供针对 zipbombs 的保护。

在使用 tarfile 模块的 python 代码中，检测 zip 炸弹最优雅的方法是什么，最好不要从 tarfile 模块复制太多逻辑（例如透明解压缩支持）？

而且，只是为了让它不那么简单：不涉及真正的文件；输入是一个类似文件的对象（由 Web 框架提供，代表用户上传的文件）。

score 13 · Accepted Answer

您可以使用resource模块来限制您的进程及其子进程可用的资源。

如果您需要在内存中解压缩，那么您可以设置resource.RLIMIT_AS（或RLIMIT_DATA, RLIMIT_STACK），例如，使用上下文管理器将其自动恢复到以前的值：

import contextlib
import resource

@contextlib.contextmanager
def limit(limit, type=resource.RLIMIT_AS):
    soft_limit, hard_limit = resource.getrlimit(type)
    resource.setrlimit(type, (limit, hard_limit)) # set soft limit
    try:
        yield
    finally:
        resource.setrlimit(type, (soft_limit, hard_limit)) # restore

with limit(1 << 30): # 1GB 
    # do the thing that might try to consume all memory

如果达到限制；MemoryError被提出。

score 6 · Accepted Answer

这将确定 gzip 流的未压缩大小，同时使用有限的内存：

#!/usr/bin/python
import sys
import zlib
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
while True:
    buf = z.unconsumed_tail
    if buf == "":
        buf = f.read(1024)
        if buf == "":
            break
    got = z.decompress(buf, 4096)
    if got == "":
        break
    total += len(got)
print total
if z.unused_data != "" or f.read(1024) != "":
    print "warning: more input after end of gzip stream"

提取时，它将返回对 tar 文件中所有文件所需空间的略微高估。长度包括那些文件，以及 tar 目录信息。

gzip.py 代码不控制解压缩的数据量，除了输入数据的大小。在 gzip.py 中，它一次读取 1024 个压缩字节。因此，如果您对未压缩数据的内存使用量最多约为 1056768 字节（1032 * 1024，其中 1032:1 是 deflate 的最大压缩比），您可以使用 gzip.py。这里的解决方案zlib.decompress与第二个参数一起使用，它限制了未压缩数据的数量。gzip.py 没有。

这将通过解码 tar 格式准确确定提取的 tar 条目的总大小：

#!/usr/bin/python

import sys
import zlib

def decompn(f, z, n):
    """Return n uncompressed bytes, or fewer if at the end of the compressed
       stream.  This only decompresses as much as necessary, in order to
       avoid excessive memory usage for highly compressed input.
    """
    blk = ""
    while len(blk) < n:
        buf = z.unconsumed_tail
        if buf == "":
            buf = f.read(1024)
        got = z.decompress(buf, n - len(blk))
        blk += got
        if got == "":
            break
    return blk

f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
left = 0
while True:
    blk = decompn(f, z, 512)
    if len(blk) < 512:
        break
    if left == 0:
        if blk == "\0"*512:
            continue
        if blk[156] in ["1", "2", "3", "4", "5", "6"]:
            continue
        if blk[124] == 0x80:
            size = 0
            for i in range(125, 136):
                size <<= 8
                size += blk[i]
        else:
            size = int(blk[124:136].split()[0].split("\0")[0], 8)
        if blk[156] not in ["x", "g", "X", "L", "K"]:
                total += size
        left = (size + 511) // 512
    else:
        left -= 1
print total
if blk != "":
    print "warning: partial final block"
if left != 0:
    print "warning: tar file ended in the middle of an entry"
if z.unused_data != "" or f.read(1024) != "":
    print "warning: more input after end of gzip stream"

您可以使用它的变体来扫描 tar 文件中的炸弹。这样做的好处是在您甚至必须解压缩该数据之前就可以在标头信息中找到较大的尺寸。

至于 .tar.bz2 档案，Python bz2 库（至少从 3.3 开始）对于消耗过多内存的 bz2 炸弹不可避免地不安全。该bz2.decompress函数不提供第二个参数zlib.decompress。更糟糕的是，由于运行长度编码，bz2 格式的最大压缩率比 zlib 高得多。bzip2 将 1 GB 的零压缩为 722 字节。因此，即使没有第二个参数，您也无法bz2.decompress通过测量输入来测量输出。zlib.decompress对解压后的输出大小没有限制是 Python 接口的一个根本缺陷。

我查看了 3.3 中的 _bz2module.c 以查看是否有未记录的方式来使用它来避免此问题。没有其他办法了。那里的decompress函数只是不断增长结果缓冲区，直到它可以解压缩所有提供的输入。_bz2module.c 需要修复。

score 3 · Accepted Answer

如果你是为linux开发的，你可以在单独的进程中运行解压并使用ulimit来限制内存使用。

import subprocess
subprocess.Popen("ulimit -v %d; ./decompression_script.py %s" % (LIMIT, FILE))

请记住，decompression_script.py 应该在写入磁盘之前解压缩内存中的整个文件。

score 3 · Accepted Answer

我想答案是：没有简单的、现成的解决方案。这是我现在使用的：

class SafeUncompressor(object):
    """Small proxy class that enables external file object
    support for uncompressed, bzip2 and gzip files. Works transparently, and
    supports a maximum size to avoid zipbombs.
    """
    blocksize = 16 * 1024

    class FileTooLarge(Exception):
        pass

    def __init__(self, fileobj, maxsize=10*1024*1024):
        self.fileobj = fileobj
        self.name = getattr(self.fileobj, "name", None)
        self.maxsize = maxsize
        self.init()

    def init(self):
        import bz2
        import gzip
        self.pos = 0
        self.fileobj.seek(0)
        self.buf = ""
        self.format = "plain"

        magic = self.fileobj.read(2)
        if magic == '\037\213':
            self.format = "gzip"
            self.gzipobj = gzip.GzipFile(fileobj = self.fileobj, mode = 'r')
        elif magic == 'BZ':
            raise IOError, "bzip2 support in SafeUncompressor disabled, as self.bz2obj.decompress is not safe"
            self.format = "bz2"
            self.bz2obj = bz2.BZ2Decompressor()
        self.fileobj.seek(0)


    def read(self, size):
        b = [self.buf]
        x = len(self.buf)
        while x < size:
            if self.format == 'gzip':
                data = self.gzipobj.read(self.blocksize)
                if not data:
                    break
            elif self.format == 'bz2':
                raw = self.fileobj.read(self.blocksize)
                if not raw:
                    break
                # this can already bomb here, to some extend.
                # so disable bzip support until resolved.
                # Also monitor http://stackoverflow.com/questions/13622706/how-to-protect-myself-from-a-gzip-or-bzip2-bomb for ideas
                data = self.bz2obj.decompress(raw)
            else:
                data = self.fileobj.read(self.blocksize)
                if not data:
                    break
            b.append(data)
            x += len(data)

            if self.pos + x > self.maxsize:
                self.buf = ""
                self.pos = 0
                raise SafeUncompressor.FileTooLarge, "Compressed file too large"
        self.buf = "".join(b)

        buf = self.buf[:size]
        self.buf = self.buf[size:]
        self.pos += len(buf)
        return buf

    def seek(self, pos, whence=0):
        if whence != 0:
            raise IOError, "SafeUncompressor only supports whence=0"
        if pos < self.pos:
            self.init()
        self.read(pos - self.pos)

    def tell(self):
        return self.pos

它不适用于 bzip2，因此部分代码被禁用。原因是它bz2.BZ2Decompressor.decompress已经可以产生不需要的大量数据。

score 0 · Accepted Answer

我还需要处理上传的 zip 文件中的 zip 炸弹。

我通过创建一个固定大小的 tmpfs 并解压缩到它来做到这一点。如果提取的数据太大，则 tmpfs 将耗尽空间并给出错误。

这是创建 200M tmpfs 以解压缩到的 linux 命令。

sudo mkdir -p /mnt/ziptmpfs
echo 'tmpfs   /mnt/ziptmpfs         tmpfs   rw,nodev,nosuid,size=200M          0  0' | sudo tee -a /etc/fstab

python - 如何保护自己免受 gzip 或 bzip2 炸弹的伤害？

5 回答 5

Related

Reference