python - 保存在磁盘上的 numpy 数组中的随机访问

Question

我有一个大A的 numpy 形状数组(2_000_000, 2000)，dtype float64需要 32 GB。

（或者将相同的数据分成10个形状数组（200_000、2000），序列化可能更容易？）。

我们如何将它序列化到磁盘，以便我们可以快速随机读取数据的任何部分？

更准确地说，我需要能够从A随机起始索引中读取一万个形状 (16, 2 000) 的窗口i：

L = []
for i in range(10_000):
    i = random.randint(0, 2_000_000 - 16):
    window = A[i:i+16, :]         # window of A of shape (16, 2000) starting at a random index i
    L.append(window)
WINS = np.concatenate(L)   # shape (10_000, 16, 2000) of float64, ie: ~ 2.4 GB

假设我只有 8 GB 的 RAM 可用于此任务；A在 RAM中加载整个 32 GB 是完全不可能的。

我们如何在磁盘上序列化的 numpy 数组中读取这样的窗口？（.h5 格式或任何其他格式）

注意：读取是在随机起始索引处完成的这一事实很重要。

score 1 · Accepted Answer

此示例说明如何将 HDF5 文件用于您描述的过程。

shape(2_000_000, 2000)首先，创建一个包含和dtype=float64值的数据集的 HDF5 文件。我使用了尺寸变量，所以你可以修改它。

import numpy as np
import h5py
import random

h5_a0, h5_a1 = 2_000_000, 2_000

with h5py.File('SO_68206763.h5','w') as h5f:
    dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
    
    incr = 1_000
    a0 = h5_a0//incr
    for i in range(incr):
        arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
        dset[i*a0:i*a0+a0, :] = arr       
    print(dset[-1,0:10])  # quick dataset check of values in last row

接下来，以读取模式打开文件，读取 10_000 个形状的随机数组切片并附(16,2_000)加到列表中L。最后，将列表转换为数组WINS。请注意，默认情况下，该数组将有 2 个轴 -.reshape()如果您希望每个注释有 3 个轴，则需要使用（也显示了重塑）。

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    L = []
    ds0, ds1 = dset.shape[0], dset.shape[1]
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        window = dset[ir:ir+16, :]  # window from dset of shape (16, 2000) starting at a random index i
        L.append(window)
    WINS = np.concatenate(L)   # shape (160_000, 2_000) of float64,
    print(WINS.shape, WINS.dtype)
    WINS = np.concatenate(L).reshape(10_0000,16,ds1)   # reshaped to (10_000, 16, 2_000) of float64
    print(WINS.shape, WINS.dtype)

The procedure above is not memory efficient. You wind up with 2 copies of the randomly sliced data: in both list L and array WINS. If memory is limited, this could be a problem. To avoid the intermediate copy, read the random slide of data directly to an array. Doing this simplifies the code, and reduces the memory footprint. That method is shown below (WINS2 is a 2 axis array, and WINS3 is a 3 axis array).

with h5py.File('SO_68206763.h5','r') as h5f:
    dset = h5f['test']
    ds0, ds1 = dset.shape[0], dset.shape[1]
    WINS2 = np.empty((10_000*16,ds1))
    WINS3 = np.empty((10_000,16,ds1))
    for i in range(10_000):
        ir = random.randint(0, ds0 - 16)
        WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
        WINS3[i,:,:] = dset[ir:ir+16, :]

score 0 · Accepted Answer

An alternative soluton to h5py datasets that I tried and that works is using memmap, as suggested in @RyanPepper's comment.

Write the data as binary

import numpy as np
with open('a.bin', 'wb') as A:
    for f in range(1000):
        x =  np.random.randn(10*2000).astype('float32').reshape(10, 2000)
        A.write(x.tobytes())
        A.flush()

Open later as `memmap`

A = np.memmap('a.bin', dtype='float32', mode='r').reshape((-1, 2000))
print(A.shape)  # (10000, 2000)
print(A[1234:1234+16, :])  # window

python - 保存在磁盘上的 numpy 数组中的随机访问

2 回答 2

Write the data as binary

Open later as memmap

Related

Reference

Open later as `memmap`