python - 为什么在加载大数据并且 RAM 内存不足时使用数据库（redis、SQL）会有所帮助？

Question

我需要从一个目录中获取 100 000 张图像，将它们全部放在一个大字典中，其中键是图片的 id，值是图像像素的 numpy 数组。创建这个字典需要 19 GB 的 RAM，我总共有 24 GB。然后我需要根据键对字典进行排序，最后只取这个有序字典的值并将其保存为一个大的 numpy 数组。我需要这个大的 numpy 数组，因为我想将它发送到 train_test_split sklearn 函数并将整个数据拆分为针对其标签的训练和测试集。我发现了这个问题，他们在创建 19GB 字典后尝试对字典进行排序的步骤中遇到 RAM 用完的问题：如何对 LARGE 字典进行排序，人们建议使用数据库。

def save_all_images_as_one_numpy_array():
    data_dict = {}
    for img in os.listdir('images'):
        id_img = img.split('_')[1]
        loadimg = load_img(os.path.join('images', img))
        x = image.img_to_array(loadimg)
        data_dict[id_img] = x

data_dict = np.stack([ v for k, v in sorted(data_dict.items(), key = lambda x: int(x[0]))])
mmamfile = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='w+',shape=data_dict.shape)
mmamfile[:] = data_dict[:]


def load_numpy_array_with_images():
    a = open_memmap('trythismmapfile.npy', dtype=np.float32, mode='r')

使用 np.stack 时，我将每个 numpy 数组堆叠在新数组中，这就是我用完 RAM 的地方。我买不起更多的内存。我以为我可以在 docker 容器中使用 redis，但我不明白为什么以及如何使用数据库来解决我的问题？

score 1 · Accepted Answer

使用 DB 有帮助的原因是 DB 库将数据存储在硬盘上而不是内存中。如果您查看链接答案建议的库的文档，那么您会看到第一个参数是文件名，表明使用了硬盘。
https://docs.python.org/2/library/bsddb.html#bsddb.hashopen

但是，链接的问题是关于按值排序，而不是按键排序。尽管在训练模型时您可能仍然会遇到内存问题，但按键排序将占用更少的内存。我建议尝试一些类似的东西

# Get the list of file names
imgs = os.listdir('images')

# Create a mapping of ID to file name
# This will allow us to sort the IDs then load the files in order
img_ids = {int(img.split('_')[1]): img for img in imgs}

# Get the list of file names sorted by ID
sorted_imgs = [v for k, v in sorted(img_ids.items(), key=lambda x: x[0])]

# Define a function for loading a named img
def load_img(img):
    loadimg = load_img(os.path.join('images', img))
    return image.img_to_array(loadimg)

# Iterate through the sorted file names and stack the results
data_dict = np.stack([load_img(img) for img in sorted_imgs])

python - 为什么在加载大数据并且 RAM 内存不足时使用数据库（redis、SQL）会有所帮助？

1 回答 1

Related

Reference