python - 减小 cPickle 对象的大小

Question

我正在运行创建大型对象的代码，其中包含多个用户定义的类，然后我必须对其进行序列化以供以后使用。据我所知，只有酸洗才能满足我的要求。我一直在使用 cPickle 来存储它们，但它生成的对象大小约为 40G，来自运行在 500 mb 内存中的代码。序列化的速度不是问题，但对象的大小才是问题。有什么技巧或替代方法可以让泡菜变小吗？

score 58 · Accepted Answer

您可以将 cPickledump调用与 zipfile 结合起来：

import cPickle
import gzip

def save_zipped_pickle(obj, filename, protocol=-1):
    with gzip.open(filename, 'wb') as f:
        cPickle.dump(obj, f, protocol)

并重新加载一个压缩的腌制对象：

def load_zipped_pickle(filename):
    with gzip.open(filename, 'rb') as f:
        loaded_object = cPickle.load(f)
        return loaded_object

score 48 · Accepted Answer

如果您必须使用 pickle 并且没有其他序列化方法适合您，您始终可以将 pickle 通过bzip2. 唯一的问题是bzip2有点慢......gzip应该更快，但文件大小几乎是 2 倍大：

In [1]: class Test(object):
            def __init__(self):
                self.x = 3841984789317471348934788731984731749374
                self.y = 'kdjsaflkjda;sjfkdjsf;klsdjakfjdafjdskfl;adsjfl;dasjf;ljfdlf'
        l = [Test() for i in range(1000000)]

In [2]: import cPickle as pickle          
        with open('test.pickle', 'wb') as f:
            pickle.dump(l, f)
        !ls -lh test.pickle
-rw-r--r--  1 viktor  staff    88M Aug 27 22:45 test.pickle

In [3]: import bz2
        import cPickle as pickle
        with bz2.BZ2File('test.pbz2', 'w') as f:
            pickle.dump(l, f)
        !ls -lh test.pbz2
-rw-r--r--  1 viktor  staff   2.3M Aug 27 22:47 test.pbz2

In [4]: import gzip
        import cPickle as pickle
        with gzip.GzipFile('test.pgz', 'w') as f:
            pickle.dump(l, f)
        !ls -lh test.pgz
-rw-r--r--  1 viktor  staff   4.8M Aug 27 22:51 test.pgz

所以我们看到文件大小bzip2几乎小了 40gzip倍，小了 20 倍。gzip 在性能上与原始 cPickle 非常接近，如您所见：

cPickle : best of 3: 18.9 s per loop
bzip2   : best of 3: 54.6 s per loop
gzip    : best of 3: 24.4 s per loop

score 3 · Accepted Answer

您可能想要使用更有效的酸洗协议。

截至目前，有三种泡菜协议：

协议版本 0 是原始的 ASCII 协议，向后兼容早期版本的 Python。

协议版本 1 是旧的二进制格式，它也与早期版本的 Python 兼容。

协议版本 2 是在 Python 2.3 中引入的。它提供了更有效的新型类的酸洗。

此外，默认是协议 0，效率最低的一个：

如果未指定协议，则使用协议 0。如果协议指定为负值或 HIGHEST_PROTOCOL，将使用可用的最高协议版本。

让我们检查一下使用最新协议（目前是协议 2（最有效的协议））和使用协议 0（默认）之间的大小差异，作为任意示例。请注意，我在这里使用 protocol=-1 以确保我们始终使用最新的协议，并且我导入 cPickle 以确保我们使用更快的 C 实现：

import numpy
from sys import getsizeof
import cPickle as pickle

# Create list of 10 arrays of size 100x100
a = [numpy.random.random((100, 100)) for _ in xrange(10)]

# Pickle to a string in two ways
str_old = pickle.dumps(a, protocol=0)
str_new = pickle.dumps(a, protocol=-1)

# Measure size of strings
size_old = getsizeof(str_old)
size_new = getsizeof(str_new)

# Print size (in kilobytes) using old, using new, and the ratio
print size_old / 1024.0, size_new / 1024.0, size_old / float(size_new)

我得到的打印输出是：

2172.27246094 781.703125 2.77889698975

表示旧协议酸洗用了2172KB，新协议酸洗用了782KB，相差x2.8倍。请注意，此因素特定于此示例 - 您的结果可能会有所不同，具体取决于您正在酸洗的对象。

python - 减小 cPickle 对象的大小

3 回答 3

Related

Reference