python-3.x - 刚切换到 TensorFlow 2.1 并收到一些烦人的警告

Question

系统信息：

笔记本电脑
操作系统平台和发行版：Ubuntu Linux、18.04、x64
TensorFlow 安装自：pip
TensorFlow 版本：2.1.0
Python版本：3.6.9
GPU型号和内存：nVidia RTX2060 6GB
CPU型号：i7-9850H
内存：16GB

我正在使用另一台 PC 在 CPU 上使用 TensorFlow 2.0。

我安装了（使用https://www.tensorflow.org/install/gpu上的指南）CUDA 10.1。

我开始在 26998 个训练图像和 1000 个训练图像的数据集上使用 ResNet50V2 为 NN 运行一个旧脚本，作为 2 个类的验证。

互联网

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
keras_layer (KerasLayer)     (None, 1792)              4363712   
_________________________________________________________________
dense (Dense)                (None, 64)                114752    
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
=================================================================
Total params: 4,478,594
Trainable params: 114,882
Non-trainable params: 4,363,712
_________________________________________________________________

其中 keras_layer 是从 tensorflow_hub 获得的 resnet。

作为第一个问题，我得到了一个CUDA_ERROR_OUT_OF_MEMORY我解决的添加

physical_devices = tf.config.experimental.list_physical_devices('GPU')
for dev in physical_devices:
  try:
    tf.config.experimental.set_memory_growth(dev, True)
    print(dev, "SET MEMORY GROWTH")
  except:
    print("Device config error")
    sys.exit(1)

但是现在我收到了类似的警告：

2020-04-07 01:39:57.857284: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 2.70G (2897281024 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

2020-04-07 01:39:58.035192: W tensorflow/core/common_runtime/bfc_allocator.cc:309] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.

都打印了好几次。

在此之后我得到：

2020-04-07 01:41:59.069302: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

我读到它们不相关，但我不清楚可能导致第二次警告的原因。

最后，出现了这样的情况：

WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']

（我认为它们是由三个不同的问题引起的，我决定将所有问题都发布在一个问题中以防止垃圾邮件，但如果这是一个问题，我可以分成不同的线程。）

我曾经ImageDataGenerator生成数据集：

train_image_generator = ImageDataGenerator(rescale=1./255., rotation_range=10., horizontal_flip=True) # Generator for our training data
validation_image_generator = ImageDataGenerator(rescale=1./255.) # Generator for our validation data

train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
                                                        directory=train_dir,
                                                        shuffle=True,
                                                        target_size=(IMG_H, IMG_W),
                                                        class_mode='sparse')

validation_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
                                                          directory=validation_dir,
                                                          shuffle=True,
                                                          target_size=(IMG_H, IMG_W),
                                                          class_mode='sparse')

如果需要其他代码，我会添加。

谢谢。

编辑1：

对于警告：

2020-04-07 01:41:59.069302: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

我试图设置workers=1并fit()消失，但我仍然不知道这个警告的原因和后果。

score 0 · Accepted Answer

0

我认为这与 TensorFlow 2.1 版有关

尝试升级到2.2或更高版本，可能会解决问题

于 2020-09-07T05:27:07.183 回答

score 0 · Accepted Answer

This Error Is due to GPU is already occupied due to your previous Run of the program. Now when you try to re-run there is no memory available to occupy model again.

Do the Following -

Open terminal & type nivida-smi
Find the process ID (PID) occupying your GPU
Kill the process (PID) occupying the gpu using kill -9 PID

Note - You can also kill process using top

python-3.x - 刚切换到 TensorFlow 2.1 并收到一些烦人的警告

2 回答 2

Related

Reference