如果您使用的是 TensorFlow 2,那么我建议您尝试两种方法:
- 使用
.flow_from_directory()
: 正如文档所说,您实际上可以将路径传递到保存图像的目录,然后您的datagen
对象就可以传递给model.fit()
. 这是我在上面链接的 TensorFlow 文档中提供的示例(为清楚起见,还添加了一些附加注释):
# Set the augmentations the data generators will do
train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)
# Instantiate a DirectoryIterator - this yields the batches of data samples + their labels
train_generator = train_datagen.flow_from_directory(
'data/train',
target_size=(150, 150),
batch_size=32,
class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
'data/validation',
target_size=(150, 150),
batch_size=32,
class_mode='binary')
# Train a Sequential model
model.fit(
train_generator,
steps_per_epoch=2000,
epochs=50,
validation_data=validation_generator,
validation_steps=800)
- Using
tf.data.Dataset.from_generator
:如果您想利用tf.data
API,并且您的数据集尚未拆分为训练集和测试集,这种方法可能对您更方便。这是它如何工作的示例(来自文档中的不同页面):
# This example uses an image dataset that has NOT been split into train/test yet
flowers = tf.keras.utils.get_file(
'flower_photos',
'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
untar=True)
# Like before, set the data augmentations
img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)
# (Optional) Double-check the dimensions of a single batch
images, labels = next(img_gen.flow_from_directory(flowers))
print(images.dtype, images.shape) # float32 (32, 256, 256, 3)
print(labels.dtype, labels.shape) # float32 (32, 5)
# Now, you can make a dataset with the augmentations
ds = tf.data.Dataset.from_generator(
lambda: img_gen.flow_from_directory(flowers),
output_types=(tf.float32, tf.float32),
output_shapes=([32,256,256,3], [32,5])
)
当然,您可能仍然想知道,“我们如何将这个ds
变量拆分为训练集和测试集?”
幸运的是,Angel Igareta就这个主题写了一篇很棒的博客文章。下面我将只包含解决我们问题的代码片段:
def get_dataset_partitions_tf(ds, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=10000):
"""Credit to Angel Igareta at https://towardsdatascience.com/how-to-split-a-tensorflow-dataset-into-train-validation-and-test-sets-526c8dd29438 for this code."""
assert (train_split + test_split + val_split) == 1
if shuffle:
# Specify seed to always have the same split distribution between runs
ds = ds.shuffle(shuffle_size, seed=12)
train_size = int(train_split * ds_size)
val_size = int(val_split * ds_size)
train_ds = ds.take(train_size)
val_ds = ds.skip(train_size).take(val_size)
test_ds = ds.skip(train_size).skip(val_size)
return train_ds, val_ds, test_ds
通过这种方式,您将能够将您的数据集传递给model.fit()
TensorFlow,而 TensorFlow 基本上会在训练时为您进行数据扩充。
最后但并非最不重要的 - 在你的情况下,我相信你会想要传递featurewise_std_normalization=True
给ImageDataGenerator
构造函数。如果我错过了您的问题中的某些内容,请告诉我,但我认为实际上并没有为此命名的参数feature_std_normalization
。