python - 在文件中创建工作数据结构

Question

我正在创建一个非常大的数组。我不想将此数组存储在内存中，而是希望能够将其写入文件。这需要采用我以后可以导入的格式。

我会使用泡菜，但似乎泡菜用于完成的文件结构。

在下面的示例中，我需要一种方法让 out 变量成为文件而不是内存存储对象：

out = []
for x in y:
    z = []
    #get lots of data into z
    out.append(z)

score 2 · Accepted Answer

看看streaming-pickle。

streaming-pickle 允许您以流式（增量）方式将一系列 Python 数据结构保存/加载到磁盘或从磁盘加载，因此使用的内存比常规 pickle 少得多。

它实际上只是一个包含三个简短方法的文件。我添加了一个带有示例的片段：

try:
    from cPickle import dumps, loads
except ImportError:
    from pickle import dumps, loads


def s_dump(iterable_to_pickle, file_obj):
    """ dump contents of an iterable iterable_to_pickle to file_obj, a file
    opened in write mode """
    for elt in iterable_to_pickle:
        s_dump_elt(elt, file_obj)

def s_dump_elt(elt_to_pickle, file_obj):
    """ dumps one element to file_obj, a file opened in write mode """
    pickled_elt_str = dumps(elt_to_pickle)
    file_obj.write(pickled_elt_str)
    # record separator is a blank line
    # (since pickled_elt_str might contain its own newlines)
    file_obj.write('\n\n')

def s_load(file_obj):
    """ load contents from file_obj, returning a generator that yields one
        element at a time """
    cur_elt = []
    for line in file_obj:
        cur_elt.append(line)

        if line == '\n':
            pickled_elt_str = ''.join(cur_elt)
            elt = loads(pickled_elt_str)
            cur_elt = []
            yield elt

以下是您可以使用它的方法：

from __future__ import print_function
import os
import sys

if __name__ == '__main__':
    if os.path.exists('obj.serialized'):
        # load a file 'obj.serialized' from disk and 
        # spool through iterable      
        with open('obj.serialized', 'r') as handle:
            _generator = s_load(handle)
            for element in _generator:
                print(element)
    else:
        # or create it first, otherwise
        with open('obj.serialized', 'w') as handle:
            for i in xrange(100000):
                s_dump_elt({'i' : i}, handle)

score 1 · Accepted Answer

1

HDF5可能吗？它得到了相当广泛的支持，并允许您附加到现有数据集。

于 2012-12-12T15:00:17.993 回答

score 0 · Accepted Answer

我可以想象你使用带有长度指示符的字符串酸洗：

import os
import struct
import pickle # or cPickle

def loader(inf):
    while True:
        s = inf.read(4)
        if not s: return
        length, = struct.unpack(">L", s)
        data = inf.read(length)
        yield pickle.loads(data)

if __name__ == '__main__':
    if os.path.exists('dumptest'):
        # load file
        with open('dumptest', 'rb') as inf:
            for element in loader(inf):
                print element
    else:
        # or create it first, otherwise
        with open('dumptest', 'wb') as outf:
            for i in xrange(100000):
                dump = pickle.dumps({'i' : i}, protocol=-1) # or whatever you want as protocol...
                lenstr = struct.pack(">L", len(dump))
                outf.write(lenstr + dump)

这不会缓存任何超过实际需要的数据，将项目彼此分开，并且与所有酸洗协议兼容。

python - 在文件中创建工作数据结构

3 回答 3

Related

Reference