我正在创建一个非常大的数组。我不想将此数组存储在内存中,而是希望能够将其写入文件。这需要采用我以后可以导入的格式。
我会使用泡菜,但似乎泡菜用于完成的文件结构。
在下面的示例中,我需要一种方法让 out 变量成为文件而不是内存存储对象:
out = []
for x in y:
z = []
#get lots of data into z
out.append(z)
streaming-pickle 允许您以流式(增量)方式将一系列 Python 数据结构保存/加载到磁盘或从磁盘加载,因此使用的内存比常规 pickle 少得多。
它实际上只是一个包含三个简短方法的文件。我添加了一个带有示例的片段:
try:
from cPickle import dumps, loads
except ImportError:
from pickle import dumps, loads
def s_dump(iterable_to_pickle, file_obj):
""" dump contents of an iterable iterable_to_pickle to file_obj, a file
opened in write mode """
for elt in iterable_to_pickle:
s_dump_elt(elt, file_obj)
def s_dump_elt(elt_to_pickle, file_obj):
""" dumps one element to file_obj, a file opened in write mode """
pickled_elt_str = dumps(elt_to_pickle)
file_obj.write(pickled_elt_str)
# record separator is a blank line
# (since pickled_elt_str might contain its own newlines)
file_obj.write('\n\n')
def s_load(file_obj):
""" load contents from file_obj, returning a generator that yields one
element at a time """
cur_elt = []
for line in file_obj:
cur_elt.append(line)
if line == '\n':
pickled_elt_str = ''.join(cur_elt)
elt = loads(pickled_elt_str)
cur_elt = []
yield elt
以下是您可以使用它的方法:
from __future__ import print_function
import os
import sys
if __name__ == '__main__':
if os.path.exists('obj.serialized'):
# load a file 'obj.serialized' from disk and
# spool through iterable
with open('obj.serialized', 'r') as handle:
_generator = s_load(handle)
for element in _generator:
print(element)
else:
# or create it first, otherwise
with open('obj.serialized', 'w') as handle:
for i in xrange(100000):
s_dump_elt({'i' : i}, handle)
我可以想象你使用带有长度指示符的字符串酸洗:
import os
import struct
import pickle # or cPickle
def loader(inf):
while True:
s = inf.read(4)
if not s: return
length, = struct.unpack(">L", s)
data = inf.read(length)
yield pickle.loads(data)
if __name__ == '__main__':
if os.path.exists('dumptest'):
# load file
with open('dumptest', 'rb') as inf:
for element in loader(inf):
print element
else:
# or create it first, otherwise
with open('dumptest', 'wb') as outf:
for i in xrange(100000):
dump = pickle.dumps({'i' : i}, protocol=-1) # or whatever you want as protocol...
lenstr = struct.pack(">L", len(dump))
outf.write(lenstr + dump)
这不会缓存任何超过实际需要的数据,将项目彼此分开,并且与所有酸洗协议兼容。