1

我有一个像这样的 Seldon 部署:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://seldon-models/mlflow/elasticnet_wine
        name: classifier
      name: default
      replicas: 1     

模型已从服务器成功下载,但一段时间后,pod 会进入状态crashloop并一次又一次地重新启动。

当我看到日志时,没有错误,因为日志已经重新启动,我只能看到 python 包是如何下载的。

PS C:\Users\xxx\mlflow> kubectl logs -p -c wines-classifier model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Executing before-run script
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
scipy-1.1.0          | 13.2 MB   | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
openssl-1.1.1g       | 2.5 MB    | ########## | 100%
mkl_fft-1.0.6        | 135 KB    | ########## | 100%
blas-1.0             | 6 KB      | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
sqlite-3.32.3        | 1.1 MB    | ########## | 100%
numpy-1.15.4         | 34 KB     | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
certifi-2020.6.20    | 156 KB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | #########  |  91%

现在,尝试使用-p@arghya-sadhu 提出的参数:

PS C:\Users\xxx\mlflow> kubectl logs -p model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp wines-classifier
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
scipy-1.1.0          | 13.2 MB   | #########3 |  93%

以及吊舱的描述:

PS C:\Users\ivarea\repo\smartgraph\mlflow-v2> kubectl describe pod model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Name:         model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Namespace:    default
Priority:     0
Node:         mlops-control-plane/172.19.0.2
Start Time:   Thu, 25 Jun 2020 10:08:20 +0200
Labels:       app=model-a-wines-classifier-0-wines-classifier
              fluentd=true
              pod-template-hash=5b8bc7889d
              seldon-app=model-a-wines-classifier
              seldon-app-svc=model-a-wines-classifier-wines-classifier
              seldon-deployment-id=model-a
              version=wines-classifier
Annotations:  prometheus.io/path: /prometheus
              prometheus.io/scrape: true
Status:       Running
IP:           10.244.0.17
IPs:
  IP:           10.244.0.17
Controlled By:  ReplicaSet/model-a-wines-classifier-0-wines-classifier-5b8bc7889d
Init Containers:
  wines-classifier-model-initializer:
    Container ID:  containerd://6a3b158cf4218f8c177f6d18eb5d0387946bf9cc36f1173754b68a029483da8b
    Image:         gcr.io/kfserving/storage-initializer:0.2.2
    Image ID:      gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
    Port:          <none>
    Host Port:     <none>
    Args:
      gs://seldon-models/mlflow/model-a
      /mnt/models
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Jun 2020 10:08:24 +0200
      Finished:     Thu, 25 Jun 2020 10:08:47 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from wines-classifier-provision-location (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Containers:
  wines-classifier:
    Container ID:   containerd://536753d25877994a17d1f1a63bbaf8717dc9180b80f061152688e4c8504c8468
    Image:          seldonio/mlflowserver_rest:0.5
    Image ID:       docker.io/seldonio/mlflowserver_rest@sha256:0fd54a0a314fafc82c490c91df0c4776be454702a307b4b76e12ed6958b4ee00
    Ports:          6000/TCP, 9000/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 25 Jun 2020 10:23:28 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 25 Jun 2020 10:19:09 +0200
      Finished:     Thu, 25 Jun 2020 10:20:41 +0200
    Ready:          False
    Restart Count:  7
    Liveness:       tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
    Readiness:      tcp-socket :http delay=20s timeout=1s period=5s #success=1 #failure=3
    Environment:
      PREDICTIVE_UNIT_SERVICE_PORT:          9000
      PREDICTIVE_UNIT_ID:                    wines-classifier
      PREDICTIVE_UNIT_IMAGE:                 seldonio/mlflowserver_rest:0.5
      PREDICTOR_ID:                          wines-classifier
      PREDICTOR_LABELS:                      {"version":"wines-classifier"}
      SELDON_DEPLOYMENT_ID:                  model-a
      PREDICTIVE_UNIT_METRICS_SERVICE_PORT:  6000
      PREDICTIVE_UNIT_METRICS_ENDPOINT:      /prometheus
      PREDICTIVE_UNIT_PARAMETERS:            [{"name":"model_uri","value":"/mnt/models","type":"STRING"}]
    Mounts:
      /etc/podinfo from podinfo (rw)
      /mnt/models from wines-classifier-provision-location (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
  seldon-container-engine:
    Container ID:  containerd://938e8f7e3ac23355c8a7a475b71ab54b858aff5ca485f26b99feaba09bb60069
    Image:         docker.io/seldonio/seldon-core-executor:1.1.0
    Image ID:      docker.io/seldonio/seldon-core-executor@sha256:661173fcbc6cb4e9b56db353b19e97d04d9c086e9dc445217f84dc1721bdf894
    Ports:         8000/TCP, 8000/TCP, 5001/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --sdep
      model-a
      --namespace
      default
      --predictor
      wines-classifier
      --http_port
      8000
      --grpc_port
      5001
      --transport
      rest
      --protocol
      seldon
      --prometheus_path
      /prometheus
    State:          Running
      Started:      Thu, 25 Jun 2020 10:08:51 +0200
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   http-get http://:8000/live delay=20s timeout=60s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8000/ready delay=20s timeout=60s period=5s #success=1 #failure=3
    Environment:
      ENGINE_PREDICTOR:  <binary ommited>
      REQUEST_LOGGER_DEFAULT_ENDPOINT_PREFIX:  http://default-broker.
      SELDON_LOG_MESSAGES_EXTERNALLY:          false
    Mounts:
      /etc/podinfo from podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  wines-classifier-provision-location:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-6vqwk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6vqwk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                          Message
  ----     ------     ----                 ----                          -------
  Normal   Scheduled  <unknown>            default-scheduler             Successfully assigned default/model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp to mlops-control-plane
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier-model-initializer
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier-model-initializer
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "seldonio/mlflowserver_rest:0.5" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "docker.io/seldonio/seldon-core-executor:1.1.0" already present on machine
  Normal   Created    14m                  kubelet, mlops-control-plane  Created container seldon-container-engine
  Normal   Started    14m                  kubelet, mlops-control-plane  Started container seldon-container-engine
  Warning  Unhealthy  14m (x8 over 14m)    kubelet, mlops-control-plane  Readiness probe failed: dial tcp 10.244.0.17:9000: connect: connection refused
  Warning  Unhealthy  28s (x171 over 14m)  kubelet, mlops-control-plane  Readiness probe failed: HTTP probe failed with statuscode: 503

如何禁用重新启动,以便检查日志以查看实际错误?

4

2 回答 2

1

可能默认的 liveness 和 readiness 探针的超时时间太短,以至于分类器容器无法完成依赖项的安装。在容器启动之前,Kubernetes 已经重新启动它,因为它未能通过 liveness/readiness 探测。

在我的情况下,我必须在 Seldon 部署声明中添加以下内容以增加超时(当然您可以调整值):

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: ...
spec:
  name: ...
  predictors:
    - graph:
        ...
      name: ...
      replicas: ...
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                readinessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3
                livenessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3

于 2021-03-03T15:48:51.020 回答
0

使用-p下面示例命令中的标志来检查来自 pod (示例)的先前终止ruby(示例)容器日志的日志web-1

kubectl logs -p -c ruby web-1

使用命令检查事件kubectl get events

用于kubectl describe pod podname检查可能导致crashloop

于 2020-06-25T07:47:21.370 回答