系统信息:
- 笔记本电脑
- 操作系统平台和发行版:Ubuntu Linux、18.04、x64
- TensorFlow 安装自:pip
- TensorFlow 版本:2.1.0
- Python版本:3.6.9
- GPU型号和内存:nVidia RTX2060 6GB
- CPU型号:i7-9850H
- 内存:16GB
我正在使用另一台 PC 在 CPU 上使用 TensorFlow 2.0。
我安装了(使用https://www.tensorflow.org/install/gpu上的指南)CUDA 10.1。
我开始在 26998 个训练图像和 1000 个训练图像的数据集上使用 ResNet50V2 为 NN 运行一个旧脚本,作为 2 个类的验证。
互联网
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
keras_layer (KerasLayer) (None, 1792) 4363712
_________________________________________________________________
dense (Dense) (None, 64) 114752
_________________________________________________________________
dropout (Dropout) (None, 64) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 130
=================================================================
Total params: 4,478,594
Trainable params: 114,882
Non-trainable params: 4,363,712
_________________________________________________________________
其中 keras_layer 是从 tensorflow_hub 获得的 resnet。
作为第一个问题,我得到了一个CUDA_ERROR_OUT_OF_MEMORY
我解决的添加
physical_devices = tf.config.experimental.list_physical_devices('GPU')
for dev in physical_devices:
try:
tf.config.experimental.set_memory_growth(dev, True)
print(dev, "SET MEMORY GROWTH")
except:
print("Device config error")
sys.exit(1)
但是现在我收到了类似的警告:
2020-04-07 01:39:57.857284: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.70G (2897281024 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-04-07 01:39:58.035192: W tensorflow/core/common_runtime/bfc_allocator.cc:309] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
都打印了好几次。
在此之后我得到:
2020-04-07 01:41:59.069302: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
我读到它们不相关,但我不清楚可能导致第二次警告的原因。
最后,出现了这样的情况:
WARNING:tensorflow:sample_weight modes were coerced from
...
to
['...']
(我认为它们是由三个不同的问题引起的,我决定将所有问题都发布在一个问题中以防止垃圾邮件,但如果这是一个问题,我可以分成不同的线程。)
我曾经ImageDataGenerator
生成数据集:
train_image_generator = ImageDataGenerator(rescale=1./255., rotation_range=10., horizontal_flip=True) # Generator for our training data
validation_image_generator = ImageDataGenerator(rescale=1./255.) # Generator for our validation data
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
directory=train_dir,
shuffle=True,
target_size=(IMG_H, IMG_W),
class_mode='sparse')
validation_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
directory=validation_dir,
shuffle=True,
target_size=(IMG_H, IMG_W),
class_mode='sparse')
如果需要其他代码,我会添加。
谢谢。
编辑1:
对于警告:
2020-04-07 01:41:59.069302: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
我试图设置workers=1
并fit()
消失,但我仍然不知道这个警告的原因和后果。