python - Numpy 从谷歌云存储加载内存映射数组（mmap_mode）

Question

我想将.npy从 google 存储（gs://project/file.npy）加载到我的 google ml-job 中作为训练数据。由于文件是 +10GB 大，我想使用numpy.load()的mmap_mode选项来避免内存不足。

背景：我将 Keras 与 fit_generator 和 Keras Sequence 一起使用，从存储在 google 存储中的 .npy 加载批量数据。

要访问谷歌存储，我使用的是 BytesIO，因为不是每个库都可以访问谷歌存储。此代码在没有 mmap_mode = 'r' 的情况下工作正常：

from tensorflow.python.lib.io import file_io
from io import BytesIO

filename = 'gs://project/file'

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file)

如果我激活 mmap_mode，我会收到以下错误：

TypeError：预期的 str、字节或 os.PathLike 对象，而不是 BytesIO

我不明白为什么它现在不再接受 BytesIO。

包含 mmap_mode 的代码：

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file, mmap_mode = 'r')

痕迹：

文件“[...]/numpy/lib/npyio.py”，第 444 行，加载返回 format.open_memmap(file, mode=mmap_mode) 文件“[...]/numpy/lib/format.py”，第 829 行，在 open_memmap fp = open(os_fspath(filename), 'rb') File "[...]/numpy/compat/py3k.py"，第 237 行，在 os_fspath "not" + path_type 中。name ) TypeError: 预期的 str、bytes 或 os.PathLike 对象，而不是 BytesIO

score 0 · Accepted Answer

您可以使用从 BytesIO 传递到字节b.getvalue()

x_file = BytesIO(file_io.read_file_to_string(filename + '.npy', binary_mode = True))
x = np.load(x_file.getvalue(), mmap_mode = 'r')

python - Numpy 从谷歌云存储加载内存映射数组（mmap_mode）

1 回答 1

Related

Reference