I have to train an image classifier on an image dataset. There are about a 1 000 000 images. Each image is about 100 Kb, so in total it is 100 Gb of data.
I have to feed the trainer with all dataset about 100 times (100 epochs). Each epoch should be given by portions (about 1000 images in each) to provide stochastic gradient descent. To reduce overtraining, the portions should be the pieces of a random split of my dataset. Each epoch I should re-split it once again.
I have 16 Gb of memory. It is too little to store all data. Ergo, I have to somehow store it on disk.
Also, I know, that random-location disk read is really slow, even if I use leveldb
or something like that. So, I have to re-dump the data in right shuffled order.
How can it be done best?