2

我有一个在 Azure 机器学习服务上接受机器学习计算训练的模型。注册的模型已经存在于我的工作区中,我想将其部署到我之前在工作区中预配的预先存在的 AKS 实例。我能够成功配置和注册容器映像:

# retrieve cloud representations of the models
rf = Model(workspace=ws, name='pumps_rf')
le = Model(workspace=ws, name='pumps_le')
ohc = Model(workspace=ws, name='pumps_ohc')
print(rf); print(le); print(ohc)

<azureml.core.model.Model object at 0x7f66ab3b1f98>
<azureml.core.model.Model object at 0x7f66ab7e49b0>
<azureml.core.model.Model object at 0x7f66ab85e710>

package_list = [
  'category-encoders==1.3.0',
  'numpy==1.15.0',
  'pandas==0.24.1',
  'scikit-learn==0.20.2']

# Conda environment configuration
myenv = CondaDependencies.create(pip_packages=package_list)
conda_yml = 'file:'+os.getcwd()+'/myenv.yml'

with open(conda_yml,"w") as f:
    f.write(myenv.serialize_to_string())

配置和注册图像工作:

# Image configuration
image_config = ContainerImage.image_configuration(execution_script='score.py', 
                                                  runtime='python', 
                                                  conda_file='myenv.yml',
                                                  description='Pumps Random Forest model')


# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME, 
                              models = [rf, le, ohc],
                              image_config = image_config,
                              workspace = ws)

Creating image
Running....................
SucceededImage creation operation finished for image pumpsrfimage:2, operation "Succeeded"

附加到现有集群也可以:

# Attach the cluster to your workgroup
attach_config = AksCompute.attach_configuration(resource_group = Config.RESOURCE_GROUP,
                                                cluster_name = Config.DEPLOY_COMPUTE)
aks_target = ComputeTarget.attach(workspace=ws, 
                                  name=Config.DEPLOY_COMPUTE, 
                                  attach_configuration=attach_config)

# Wait for the operation to complete
aks_target.wait_for_completion(True)
SucceededProvisioning operation finished, operation "Succeeded"

但是,当我尝试将映像部署到现有集群时,它会失败并显示WebserviceException.

# Set configuration and service name
aks_config = AksWebservice.deploy_configuration()

# Deploy from image
service = Webservice.deploy_from_image(workspace = ws,
                                       name = 'pumps-aks-service-1' ,
                                       image = image,
                                       deployment_config = aks_config,
                                       deployment_target = aks_target)
# Wait for the deployment to complete
service.wait_for_deployment(show_output = True)
print(service.state)

WebserviceException: Unable to create service with image pumpsrfimage:1 in non "Succeeded" creation state.
---------------------------------------------------------------------------
WebserviceException                       Traceback (most recent call last)
<command-201219424688503> in <module>()
      7                                        image = image,
      8                                        deployment_config = aks_config,
----> 9                                        deployment_target = aks_target)
     10 # Wait for the deployment to complete
     11 service.wait_for_deployment(show_output = True)

/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/webservice.py in deploy_from_image(workspace, name, image, deployment_config, deployment_target)
    284                         return child._deploy(workspace, name, image, deployment_config, deployment_target)
    285 
--> 286         return deployment_config._webservice_type._deploy(workspace, name, image, deployment_config, deployment_target)
    287 
    288     @staticmethod

/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/aks.py in _deploy(workspace, name, image, deployment_config, deployment_target)

关于如何解决这个问题的任何想法?我正在 Databricks 笔记本中编写代码。此外,我能够使用 Azure 门户创建和部署集群没有问题,所以这似乎是我的代码/Python SDK 或 Databricks 与 AMLS 一起使用的方式的问题。

更新:我能够使用 Azure 门户将我的映像部署到 AKS,并且 Web 服务按预期工作。这意味着问题存在于 Databricks、Azureml Python SDK 和机器学习服务之间。

更新 2:我正在与 Microsoft 合作解决此问题。一旦我们有解决方案,将报告。

4

3 回答 3

2

在我的初始代码中,创建图像时,我没有使用:

image.wait_for_creation(show_output=True)

结果,我在创建错误CreateImageDeployImage图像之前打电话。不敢相信就这么简单..

更新的图像创建片段:

# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME, 
                              models = [rf, le, ohc],
                              image_config = image_config,
                              workspace = ws)

image.wait_for_creation(show_output=True)
于 2019-03-07T20:22:05.420 回答
1

根据个人经验,我会说您看到的错误消息可能表明图像中的脚本存在一些错误。此类错误不一定会阻止成功创建映像,但可能会阻止映像在服务中使用。但是,如果您已经成功地将镜像部署在其他服务中,那么您应该可以排除此选项。

您可以按照本指南了解有关如何在本地调试 Docker 映像以及查找日志和其他有用信息的更多信息。

于 2019-02-21T15:56:24.707 回答
0

Agreed on Arvid's answer. Were you able to succesfully run it? You can also try and deploy it to ACI, but if the problem is in the score.py, you'd have the same issue but it's quick to try. Also, a bit more painful but if you want to debug the deployment, but you can expose port tcp 5678 on your local docker deployment and use VSCode and PTVSD to connect to it and debug step by step.

于 2019-03-04T13:54:29.257 回答