python - 在python中处理大量图像的技巧

Question

我一直在尝试在 python 中处理两个包含大约 40000-50000 个图像的大文件。但是，每当我尝试将我的数据集转换为一个 numpy 数组时，都会出现内存错误。我只有大约 8GB 的 RAM，不是很多，但是，因为我缺乏 python 经验，我想知道是否有任何方法可以通过使用一些我不知道的 python 库来解决这个问题，或者也许通过优化我的代码？我想听听您对此事的看法。

我的图像处理代码：

from sklearn.cluster import MiniBatchKMeans
import numpy as np
import glob
import os
from PIL import Image
from sklearn.decomposition import PCA

image_dir1 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/train"
image_dir2 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/test1"
Standard_size = (300,200)
pca = PCA(n_components = 10)
file_open = lambda x,y: glob.glob(os.path.join(x,y))


def matrix_image(image):
    "opens image and converts it to a m*n matrix" 
    image = Image.open(image)
    print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
    image = image.resize(Standard_size)
    image = list(image.getdata())
    image = map(list,image)
    image = np.array(image)
    return image
def flatten_image(image):  
    """
    takes in a n*m numpy array and flattens it to 
    an array of the size (1,m*n)
    """
    s = image.shape[0] * image.shape[1]
    image_wide = image.reshape(1,s)
    return image_wide[0]

if __name__ == "__main__":
    train_images = file_open(image_dir1,"*.jpg")
    test_images = file_open(image_dir2,"*.jpg")
    train_set = []
    test_set = []

    "Loop over all images in files and modify them"
    train_set = [flatten_image(matrix_image(image))for image in train_images]
    test_set = [flatten_image(matrix_image(image))for image in test_images]
    train_set = np.array(train_set) #This is where the Memory Error occurs
    test_set = np.array(test_set)

小编辑：我正在使用 64 位 python

score 7 · Accepted Answer

假设每个像素有一个 4 字节整数，您试图在 (4*300*200*50000 / (1024)**3) 中保存大约 11.2 GB 的数据。2 字节整数的一半。

你有几个选择：

减少您尝试在内存中保存的图像的数量或大小
使用文件或数据库来保存数据而不是内存（对于某些应用程序可能太慢）
更有效地使用你所拥有的记忆......

而不是从列表复制到 numpy，这将暂时使用两倍的内存量，就像你在这里做的那样：

test_set = [flatten_image(matrix_image(image))for image in test_images]
test_set = np.array(test_set)

做这个：

n = len(test_images)
test_set = numpy.zeros((n,300*200),dtype=int)
for i in range(n):
    test_set[i] = flatten_image(matrix_image(test_images[i]))

score 5 · Accepted Answer

由于您的文件是 JPEG，并且您有 300x200 图像，因此对于 24 位彩色图像，您正在查看每个文件大约 1.4 MB 和至少高达 40.2 GB 的整体：

In [4]: import humanize # `pip install humanize` if you need it

In [5]: humanize.naturalsize(300*200*24, binary=True)
Out[5]: '1.4 MiB'

In [6]: humanize.naturalsize(300*200*24*30000, binary=True)
Out[6]: '40.2 GiB'

如果你有灰度，你可能有 13.4 GB 的 8 位图像：

In [7]: humanize.naturalsize(300*200*8, binary=True)
Out[7]: '468.8 KiB'

In [8]: humanize.naturalsize(300*200*8*30000, binary=True)
Out[8]: '13.4 GiB'

这也仅适用于一份。根据操作，这可能会变得更大。

越来越大

您总是可以在具有更多内存的服务器上租用一些时间。

AWS - 高达 224GB
机架空间- 高达 120GB
DigitalOcean - 高达 96 GB
Azure - 高达 56 GB

从 RAM 的数量来看这些并不是考虑哪些服务器最适合您的工作负载的唯一方法。提供商之间还有其他差异，包括 IOPS、内核数量、CPU 类型等。

训练后测试

训练模型后，您不需要完整的训练数据集。删除内存中可以删除的内容。在 Python 领域，这意味着不保留对数据的引用。奇怪的野兽，是的。

这可能意味着设置您的训练数据并在仅返回您需要的函数中创建模型。

减少内存占用

让我们想象一下，您可以将它全部存储在内存中。您可以在此处进行的一项改进是将 PIL Image 直接转换为 numpy array。现有数组不会被复制，它是原始数据的视图。但是，看起来您也需要将其展平到向量空间中。

image = Image.open(image)
print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
image = image.resize(Standard_size)
np_image = np.asarray(image).flatten()

编辑：实际上，这有助于您的代码的可维护性，但无助于性能。您可以单独对函数中的每个图像执行此操作。垃圾收集器会扔掉旧东西。继续前进，这里没什么可看的。

python - 在python中处理大量图像的技巧

2 回答 2

越来越大

训练后测试

减少内存占用

Related

Reference