我目前正在使用fizyr/retinanet来训练一个检测 3 个类的模型。当我训练模型时,我在所有课程中都获得了 0.0000 的精度。在几轮训练中,我得到了稍高的精度,例如 0.0007。

我看过这些线程,但他们的解决方案似乎不起作用: https ://github.com/fizyr/keras-retinanet/issues/647和https://github.com/fizyr/keras-retinanet/问题/1351

也就是说,我在训练命令中添加了 --image-max-side 参数。我做了这个 2560 像素。我正在使用的图像是 1920X2560 像素。训练集是 916 张图像。验证集是 258 张图像。


python train.py \
    --weights old_snapshots/resnet50_coco_best_v2.h5 \
    --backbone resnet50 \
    --batch-size 1 \
    --image-max-side 2560 \
    --epochs 50 \
    --steps 200 \
    --lr 1e-8 \
    --snapshot-path new_snapshots \
    --tensorboard-dir logs \
    --random-transform \
    csv \
    train.csv \
    classes.csv \
    --val-annotations validation.csv

我也尝试过运行上述命令而不将权重初始化为 coco。这会产生相同的结果。我已将 train.py 文件复制到我的父目录中。

我必须在 train.py 中包含这段额外的代码,这样训练就不会因 GPU 耗尽资源而停止:

devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

这是我的 train.csv 文件中的一个示例:


这是我的 classes.csv 文件:


我的安装设置是:Windows 10
Tensflow 2.3.1
CUDA Toolkit 11.0
CuDNN v7.6.3

| NVIDIA-SMI 451.48       Driver Version: 451.48       CUDA Version: 11.0     |
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Quadro T2000       WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P8     5W /  N/A |    370MiB /  4096MiB |      2%      Default |


2020-10-06 13:15:11.249841: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-10-06 13:15:13.280542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-10-06 13:15:13.326726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
2020-10-06 13:15:13.419482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Creating model, this may take a second...
2020-10-06 13:15:14.118835: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-06 13:15:14.142741: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21262d9de70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.151023: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-06 13:15:14.157396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
coreClock: 1.785GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-10-06 13:15:14.169198: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-10-06 13:15:14.208255: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:14.214516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-06 13:15:14.783905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-06 13:15:14.791223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2020-10-06 13:15:14.797282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2020-10-06 13:15:14.801179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2905 MB memory) -> physical GPU (device: 0, name:
Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-10-06 13:15:14.818767: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2120ce3da40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.825738: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro T2000, Compute Capability 7.5
Model: "retinanet"
Layer (type)                    Output Shape         Param #     Connected to
input_1 (InputLayer)            [(None, None, None,  0
conv1 (Conv2D)                  (None, None, None, 6 9408        input_1[0][0]
bn_conv1 (BatchNormalization)   (None, None, None, 6 256         conv1[0][0]
conv1_relu (Activation)         (None, None, None, 6 0           bn_conv1[0][0]
pool1 (MaxPooling2D)            (None, None, None, 6 0           conv1_relu[0][0]
res2a_branch2a (Conv2D)         (None, None, None, 6 4096        pool1[0][0]
bn2a_branch2a (BatchNormalizati (None, None, None, 6 256         res2a_branch2a[0][0]
res2a_branch2a_relu (Activation (None, None, None, 6 0           bn2a_branch2a[0][0]

P4_merged (Add)                 (None, None, None, 2 0           P5_upsampled[0][0]
P4_upsampled (UpsampleLike)     (None, None, None, 2 0           P4_merged[0][0]
C3_reduced (Conv2D)             (None, None, None, 2 131328      res3d_relu[0][0]
P6 (Conv2D)                     (None, None, None, 2 4718848     res5c_relu[0][0]
P3_merged (Add)                 (None, None, None, 2 0           P4_upsampled[0][0]
C6_relu (Activation)            (None, None, None, 2 0           P6[0][0]
P3 (Conv2D)                     (None, None, None, 2 590080      P3_merged[0][0]
P4 (Conv2D)                     (None, None, None, 2 590080      P4_merged[0][0]
P5 (Conv2D)                     (None, None, None, 2 590080      C5_reduced[0][0]
P7 (Conv2D)                     (None, None, None, 2 590080      C6_relu[0][0]
regression_submodel (Functional (None, None, 4)      2443300     P3[0][0]
classification_submodel (Functi (None, None, 3)      2422555     P3[0][0]
regression (Concatenate)        (None, None, 4)      0           regression_submodel[0][0]
classification (Concatenate)    (None, None, 3)      0           classification_submodel[0][0]
Total params: 36,424,447
Trainable params: 36,318,207
Non-trainable params: 106,240
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2020-10-06 13:15:17.712389: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-06 13:15:17.721698: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-06 13:15:17.749155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cupti64_101.dll
2020-10-06 13:15:17.855545: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
WARNING:tensorflow:From train_latest_fizyr.py:541: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/2
2020-10-06 13:15:25.776950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:28.004983: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-10-06 13:15:28.121843: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-10-06 13:15:29.193580: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.209383: I tensorflow/stream_executor/cuda/cuda_driver.cc:775] failed to allocate 858.70M (900412160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-06 13:15:29.337869: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.363332: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.464090: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:30.261915: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
  1/200 [..............................] - ETA: 0s - loss: 3.9458 - regression_loss: 2.8127 - classification_loss: 1.13312020-10-06 13:15:31.922292: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
WARNING:tensorflow:From C:\XXXXX\venv38\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
2020-10-06 13:15:32.542621: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
2020-10-06 13:15:32.580191: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 3193 callback api events and 3193 activity events.
2020-10-06 13:15:32.695250: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.734889: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.trace.json.gz
2020-10-06 13:15:32.857585: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.874147: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for memory_profile.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.memory_profile.json.gz
2020-10-06 13:15:32.901109: I tensorflow/python/profiler/internal/profiler_wrapper.cc:111] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32Dumped tool data for xplane.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.xplane.pb
Dumped tool data for overview_page.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.overview_page.pb
Dumped tool data for input_pipeline.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.kernel_stats.pb

  2/200 [..............................] - ETA: 1:41 - loss: 3.8811 - regression_loss: 2.7477 - classification_loss: 1.1334WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0590s vs `on_train_batch_end` time: 0.9618s). Check your callbacks.
Running network: 100% (165 of 165) |#########################################################################################################################################| Elapsed Time: 0:00:42 Time:  0:00:42
Parsing annotations: 100% (165 of 165) |#####################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
100 instances of class object1 with average precision: 0.0000
97 instances of class object2 with average precision: 0.0000
15 instances of class object3 with average precision: 0.0000
mAP: 0.0000



