tensorflow - 使用单个类重新训练 ssd_inception_v2 模型成功，但在某些步骤后使用两个类失败

Question

我的任务是训练对象检测模型以检测单个类（person）并将模型转换为 edgetpu 模型以在连接了 Coral 的 Raspberry Pi 上运行。最初我使用ssd_mobilenet_v2_quantized_coco模型，然后使用ssd_inception_v2_coco模型（均来自tensorflow 模型 zoo）作为重新训练的基础模型。它们都经过重新训练、转换和部署在 Raspberry Pi 中以成功进行图像检测。

在 docker 中运行的 Ubuntu 18.04 和 TensorFlow 1.14 (tensorflow:1.14.0-gpu-py3)

当需要第二类（计算机）时，我尝试重新训练ssd_inception_v2作为具有两个类（人和计算机）的基本模型。观察结果：

model-ckpt-18000（18000步后创建的检查点）成功转换为edgetpu模型

Edge TPU Compiler version 15.0.340273435

Model compiled successfully in 1755 ms.

Input model: <input_model>
Input size: 12.80MiB
Output model: <output_model>
Output size: 14.46MiB
On-chip memory used for caching model parameters: 5.51MiB
On-chip memory remaining for caching model parameters: 1.00KiB
Off-chip memory used for streaming uncached model parameters: 8.94MiB
Number of Edge TPU subgraphs: 1
Total number of operations: 129
Operation log: <log_file>

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 128
Number of operations that will run on CPU: 1
See the operation log file for individual operation details.

但是 model-ckpt-20000（在 20,000 步后创建的检查点）失败并出现错误

Edge TPU Compiler version 15.0.340273435

ERROR: :309 scale_diff / output_scale <= 0.02 was not true.
ERROR: Node number 89 (CONV_2D) failed to prepare.

ERROR: :309 scale_diff / output_scale <= 0.02 was not true.
ERROR: Node number 89 (CONV_2D) failed to prepare.

Compilation failed: Internal error

Internal compiler error. Aborting!

我使用相同的步骤来编译 ssd_mobilenet_v2_quantized_coco 模型，并且该过程成功且没有任何错误。

涉及的步骤：

重新训练模型

python3 model_main.py \
--model_dir=<model_dir> \
--pipeline_config_path=<pipeline_config_path>

如有必要，我将共享 ssd_mobilenet_v2_quantized_coco 和 ssd_inception_v2_coco 的 pipeline.config 文件。我放弃了它，因为帖子已经很长了。

将检查点转换为冻结图：

python3 export_tflite_ssd_graph.py \
--pipeline_config_path=<pipeline_config_path> \
--trained_checkpoint_prefix=<trained_checkpoint_prefix> \
--output_directory=<output_directory> \
--add_postprocessing_op=true

将冻结图转换为 tflite

tflite_convert \
--output_file=<output_file> \
--graph_def_file=<graph_def_file> \
--inference_type=QUANTIZED_UINT8 \
--input_arrays=normalized_input_image_tensor \
--output_arrays=TFLite_Detection_PostProcess,TFLite_Detection_PostProcess:1,TFLite_Detection_PostProcess:2,TFLite_Detection_PostProcess:3 \
--mean_values=128 \
--std_dev_values=128 \
--input_shapes=1,300,300,3 \
--change_concat_input_ranges=false \
--allow_nudging_weights_to_use_fast_gemm_kernel=true \
--allow_custom_ops \
--default_ranges_min=0 \
--default_ranges_max=255

编译 tflite 模型

edgetpu_compiler \
-o <output_folder> \
<tflite_file>

我搜索了与此类似的错误，但大多数帖子与此场景无关。使用ssd_inception_v2_coco时，针对单个类进行训练的过程是成功的，但是针对两个类进行训练的过程会产生错误。有人请指出过程中的错误吗？如果需要更多信息，请告诉我。

tensorflow - 使用单个类重新训练 ssd_inception_v2 模型成功，但在某些步骤后使用两个类失败

0 回答 0

Related

Reference