python - 无法在 Google Cloud 中训练我的 TensorFlow 检测器模型

Question

我正在尝试根据 Tensorflow 示例和这篇文章训练我自己的检测器模型。我确实成功地在 Macbook Pro 上进行了本地培训。问题是我没有 GPU 并且在 CPU 上执行它太慢（每次迭代大约 25 秒）。

这样，我尝试按照教程在 Google Cloud ML Engine 上运行，但无法使其正常运行。

我的文件夹结构如下所述：

+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco

我从本地培训更改为 Google Cloud 培训的步骤是：

在谷歌云存储中创建一个存储桶，并复制我的本地文件夹结构和文件；
编辑我的pipeline.config文件并将所有路径从更改Users/dev/detector/为gcc://bucketname/;
使用教程中提供的默认配置创建一个 YAML 文件；
跑

gcloud ml-engine 作业提交训练 object_detection_ date +%s\ --job-dir=gs://bucketname/models/train \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.train \ --region us-east1 \ --config /Users/dev/detector/training/cloud.yml \ -- \ --train_dir=gs://bucketname/models/train \ - -pipeline_config_path=gs://bucketname/data/pipeline.config

这样做会给我来自 MLUnits 的以下错误消息：

副本 ps 0 以非零状态 1 退出。终止原因：错误。回溯（最后一次调用）：文件“/usr/lib/python2.7/runpy.py”，第 162 行，在 _run_module_as_main “__main__”，fname，loader，pkg_name）文件“/usr/lib/python2.7/ runpy.py”，第 72 行，在 run_globals 文件中的 _run_code 执行代码“/root/.local/lib/python2.7/site-packages/object_detection/train.py”，第 49 行，从 object_detection 导入培训师文件“/ root/.local/lib/python2.7/site-packages/object_detection/trainer.py”，第 27 行，从 object_detection.builders import preprocessor_builder 文件“/root/.local/lib/python2.7/site-packages/ object_detection/builders/preprocessor_builder.py”，第 21 行，在 from object_detection.protos import preprocessor_pb2 文件“/root/.local/lib/python2.

提前致谢。

score 0 · Accepted Answer

检查andersskog在此处发布的解决方案。它对我有用。我在这里做了一个补丁。对于手动修复，请遵循以下说明：

确保您的 yaml 版本为 1.4，例如：

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

将 setup.py 更改为以下内容：

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

在 object_detection/utils/visualization_utils.py 的第 24 行（在 import matplotlib.pyplot as plt 之前）添加：

import matplotlib
matplotlib.use('agg')

在 object_detection/evaluator.py 的第 184 行，更改

tf.train.get_or_create_global_step()

至

tf.contrib.framework.get_or_create_global_step()

最后，在 object_detection/builders/optimizer_builder.py 的第 103 行，更改

tf.train.get_or_create_global_step()

至

tf.contrib.framework.get_or_create_global_step()

希望这可以帮助！

score 0 · Accepted Answer

问题是protobuf版本。你可能已经通过 brew 安装了最新的协议；和 protobuf 自 3.5.0 版以来添加了file字段https://github.com/google/protobuf/blob/9f80df026933901883da1d556b38292e14836612/CHANGES.txt#L74

因此，在上述更改之上，REQUIRED_PACKAGES将 protobuf 版本设置为'protobuf>=3.5.1'

python - 无法在 Google Cloud 中训练我的 TensorFlow 检测器模型

2 回答 2

Related

Reference