python - 当我使用 keras-retinanet / resnet50 获得 0.000 的 mAP 时该怎么办？

Question

我目前正在使用fizyr/retinanet来训练一个检测 3 个类的模型。当我训练模型时，我在所有课程中都获得了 0.0000 的精度。在几轮训练中，我得到了稍高的精度，例如 0.0007。

我看过这些线程，但他们的解决方案似乎不起作用： https ://github.com/fizyr/keras-retinanet/issues/647和https://github.com/fizyr/keras-retinanet/问题/1351

也就是说，我在训练命令中添加了 --image-max-side 参数。我做了这个 2560 像素。我正在使用的图像是 1920X2560 像素。训练集是 916 张图像。验证集是 258 张图像。

我用来训练模型的完整命令是：

python train.py \
    --weights old_snapshots/resnet50_coco_best_v2.h5 \
    --backbone resnet50 \
    --batch-size 1 \
    --image-max-side 2560 \
    --epochs 50 \
    --steps 200 \
    --lr 1e-8 \
    --snapshot-path new_snapshots \
    --tensorboard-dir logs \
    --random-transform \
    csv \
    train.csv \
    classes.csv \
    --val-annotations validation.csv

我也尝试过运行上述命令而不将权重初始化为 coco。这会产生相同的结果。我已将 train.py 文件复制到我的父目录中。

我必须在 train.py 中包含这段额外的代码，这样训练就不会因 GPU 耗尽资源而停止：

devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

这是我的 train.csv 文件中的一个示例：

dataset/202009/2020-09-18_20-26-16-480016.jpg,645,1178,819,1366,object1
dataset/202009/2020-09-18_20-26-16-480016.jpg,669,1306,1015,1486,object2
dataset/202009/2020-09-14_07-13-59-258711.jpg,,,,,
dataset/202009/2020-09-14_18-58-25-411295.jpg,,,,,
dataset/202009/2020-09-21_20-43-20-525886.jpg,1154,1214,1501,1429,object2
dataset/202009/2020-09-21_20-43-20-525886.jpg,1509,1176,1707,1396,object1
dataset/202009/2020-09-14_19-32-17-116910.jpg,,,,,

这是我的 classes.csv 文件：

object1,0
object2,1
object3,2

我的安装设置是：Windows 10
Tensflow 2.3.1
CUDA Toolkit 11.0
CuDNN v7.6.3

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 451.48       Driver Version: 451.48       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro T2000       WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P8     5W /  N/A |    370MiB /  4096MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

精度不会在多个时期内发生变化。以下是输出示例：

2020-10-06 13:15:11.249841: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-10-06 13:15:13.280542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-10-06 13:15:13.326726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
...
2020-10-06 13:15:13.419482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Creating model, this may take a second...
2020-10-06 13:15:14.118835: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-06 13:15:14.142741: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21262d9de70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.151023: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-06 13:15:14.157396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
coreClock: 1.785GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-10-06 13:15:14.169198: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
...
2020-10-06 13:15:14.208255: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:14.214516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-06 13:15:14.783905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-06 13:15:14.791223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2020-10-06 13:15:14.797282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2020-10-06 13:15:14.801179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2905 MB memory) -> physical GPU (device: 0, name:
Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-10-06 13:15:14.818767: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2120ce3da40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.825738: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro T2000, Compute Capability 7.5
Model: "retinanet"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            [(None, None, None,  0
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, None, None, 6 9408        input_1[0][0]
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, None, None, 6 256         conv1[0][0]
__________________________________________________________________________________________________
conv1_relu (Activation)         (None, None, None, 6 0           bn_conv1[0][0]
__________________________________________________________________________________________________
pool1 (MaxPooling2D)            (None, None, None, 6 0           conv1_relu[0][0]
__________________________________________________________________________________________________
res2a_branch2a (Conv2D)         (None, None, None, 6 4096        pool1[0][0]
__________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizati (None, None, None, 6 256         res2a_branch2a[0][0]
__________________________________________________________________________________________________
res2a_branch2a_relu (Activation (None, None, None, 6 0           bn2a_branch2a[0][0]
__________________________________________________________________________________________________
...

P4_merged (Add)                 (None, None, None, 2 0           P5_upsampled[0][0]
                                                                 C4_reduced[0][0]
__________________________________________________________________________________________________
P4_upsampled (UpsampleLike)     (None, None, None, 2 0           P4_merged[0][0]
                                                                 res3d_relu[0][0]
__________________________________________________________________________________________________
C3_reduced (Conv2D)             (None, None, None, 2 131328      res3d_relu[0][0]
__________________________________________________________________________________________________
P6 (Conv2D)                     (None, None, None, 2 4718848     res5c_relu[0][0]
__________________________________________________________________________________________________
P3_merged (Add)                 (None, None, None, 2 0           P4_upsampled[0][0]
                                                                 C3_reduced[0][0]
__________________________________________________________________________________________________
C6_relu (Activation)            (None, None, None, 2 0           P6[0][0]
__________________________________________________________________________________________________
P3 (Conv2D)                     (None, None, None, 2 590080      P3_merged[0][0]
__________________________________________________________________________________________________
P4 (Conv2D)                     (None, None, None, 2 590080      P4_merged[0][0]
__________________________________________________________________________________________________
P5 (Conv2D)                     (None, None, None, 2 590080      C5_reduced[0][0]
__________________________________________________________________________________________________
P7 (Conv2D)                     (None, None, None, 2 590080      C6_relu[0][0]
__________________________________________________________________________________________________
regression_submodel (Functional (None, None, 4)      2443300     P3[0][0]
                                                                 P4[0][0]
                                                                 P5[0][0]
                                                                 P6[0][0]
                                                                 P7[0][0]
__________________________________________________________________________________________________
classification_submodel (Functi (None, None, 3)      2422555     P3[0][0]
                                                                 P4[0][0]
                                                                 P5[0][0]
                                                                 P6[0][0]
                                                                 P7[0][0]
__________________________________________________________________________________________________
regression (Concatenate)        (None, None, 4)      0           regression_submodel[0][0]
                                                                 regression_submodel[1][0]
                                                                 regression_submodel[2][0]
                                                                 regression_submodel[3][0]
                                                                 regression_submodel[4][0]
__________________________________________________________________________________________________
classification (Concatenate)    (None, None, 3)      0           classification_submodel[0][0]
                                                                 classification_submodel[1][0]
                                                                 classification_submodel[2][0]
                                                                 classification_submodel[3][0]
                                                                 classification_submodel[4][0]
==================================================================================================
Total params: 36,424,447
Trainable params: 36,318,207
Non-trainable params: 106,240
__________________________________________________________________________________________________
None
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2020-10-06 13:15:17.712389: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-06 13:15:17.721698: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-06 13:15:17.749155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cupti64_101.dll
2020-10-06 13:15:17.855545: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
WARNING:tensorflow:From train_latest_fizyr.py:541: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/2
2020-10-06 13:15:25.776950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:28.004983: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-10-06 13:15:28.121843: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-10-06 13:15:29.193580: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.209383: I tensorflow/stream_executor/cuda/cuda_driver.cc:775] failed to allocate 858.70M (900412160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-06 13:15:29.337869: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.363332: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.464090: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
...
2020-10-06 13:15:30.261915: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
  1/200 [..............................] - ETA: 0s - loss: 3.9458 - regression_loss: 2.8127 - classification_loss: 1.13312020-10-06 13:15:31.922292: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
WARNING:tensorflow:From C:\XXXXX\venv38\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
2020-10-06 13:15:32.542621: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
2020-10-06 13:15:32.580191: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 3193 callback api events and 3193 activity events.
2020-10-06 13:15:32.695250: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.734889: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.trace.json.gz
2020-10-06 13:15:32.857585: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.874147: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for memory_profile.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.memory_profile.json.gz
2020-10-06 13:15:32.901109: I tensorflow/python/profiler/internal/profiler_wrapper.cc:111] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32Dumped tool data for xplane.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.xplane.pb
Dumped tool data for overview_page.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.overview_page.pb
Dumped tool data for input_pipeline.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.kernel_stats.pb

  2/200 [..............................] - ETA: 1:41 - loss: 3.8811 - regression_loss: 2.7477 - classification_loss: 1.1334WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0590s vs `on_train_batch_end` time: 0.9618s). Check your callbacks.
Running network: 100% (165 of 165) |#########################################################################################################################################| Elapsed Time: 0:00:42 Time:  0:00:42
Parsing annotations: 100% (165 of 165) |#####################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
100 instances of class object1 with average precision: 0.0000
97 instances of class object2 with average precision: 0.0000
15 instances of class object3 with average precision: 0.0000
mAP: 0.0000

如果有什么建议可以尝试提高我的精度/解决为什么找不到任何对象，请告诉我？

python - 当我使用 keras-retinanet / resnet50 获得 0.000 的 mAP 时该怎么办？

0 回答 0

Related

Reference