我有一个像这样的 Seldon 部署:
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: mlflow
spec:
name: wines
predictors:
- graph:
children: []
implementation: MLFLOW_SERVER
modelUri: gs://seldon-models/mlflow/elasticnet_wine
name: classifier
name: default
replicas: 1
模型已从服务器成功下载,但一段时间后,pod 会进入状态crashloop
并一次又一次地重新启动。
当我看到日志时,没有错误,因为日志已经重新启动,我只能看到 python 包是如何下载的。
PS C:\Users\xxx\mlflow> kubectl logs -p -c wines-classifier model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Executing before-run script
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Downloading and Extracting Packages
_libgcc_mutex-0.1 | 3 KB | ########## | 100%
readline-7.0 | 324 KB | ########## | 100%
ncurses-6.2 | 817 KB | ########## | 100%
tbb4py-2020.0 | 209 KB | ########## | 100%
scipy-1.1.0 | 13.2 MB | ########## | 100%
zlib-1.2.11 | 103 KB | ########## | 100%
xz-5.2.5 | 341 KB | ########## | 100%
openssl-1.1.1g | 2.5 MB | ########## | 100%
mkl_fft-1.0.6 | 135 KB | ########## | 100%
blas-1.0 | 6 KB | ########## | 100%
pip-20.1.1 | 1.8 MB | ########## | 100%
wheel-0.34.2 | 51 KB | ########## | 100%
libffi-3.2.1 | 40 KB | ########## | 100%
scikit-learn-0.19.1 | 3.9 MB | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB | ########## | 100%
sqlite-3.32.3 | 1.1 MB | ########## | 100%
numpy-1.15.4 | 34 KB | ########## | 100%
tk-8.6.10 | 3.0 MB | ########## | 100%
libgcc-ng-9.1.0 | 5.1 MB | ########## | 100%
setuptools-47.3.1 | 514 KB | ########## | 100%
mkl_random-1.0.1 | 324 KB | ########## | 100%
python-3.6.9 | 30.2 MB | ########## | 100%
certifi-2020.6.20 | 156 KB | ########## | 100%
numpy-base-1.15.4 | 3.4 MB | ########## | 100%
intel-openmp-2019.4 | 729 KB | ########## | 100%
libedit-3.1.20191231 | 167 KB | ########## | 100%
libstdcxx-ng-9.1.0 | 3.1 MB | ########## | 100%
tbb-2020.0 | 1.1 MB | ########## | 100%
mkl-2018.0.3 | 126.9 MB | ######### | 91%
现在,尝试使用-p
@arghya-sadhu 提出的参数:
PS C:\Users\xxx\mlflow> kubectl logs -p model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp wines-classifier
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Downloading and Extracting Packages
scikit-learn-0.19.1 | 3.9 MB | ########## | 100%
ncurses-6.2 | 817 KB | ########## | 100%
_libgcc_mutex-0.1 | 3 KB | ########## | 100%
zlib-1.2.11 | 103 KB | ########## | 100%
tbb4py-2020.0 | 209 KB | ########## | 100%
setuptools-47.3.1 | 514 KB | ########## | 100%
libedit-3.1.20191231 | 167 KB | ########## | 100%
tbb-2020.0 | 1.1 MB | ########## | 100%
xz-5.2.5 | 341 KB | ########## | 100%
mkl_random-1.0.1 | 324 KB | ########## | 100%
libgcc-ng-9.1.0 | 5.1 MB | ########## | 100%
python-3.6.9 | 30.2 MB | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB | ########## | 100%
libffi-3.2.1 | 40 KB | ########## | 100%
mkl-2018.0.3 | 126.9 MB | ########## | 100%
libstdcxx-ng-9.1.0 | 3.1 MB | ########## | 100%
readline-7.0 | 324 KB | ########## | 100%
intel-openmp-2019.4 | 729 KB | ########## | 100%
tk-8.6.10 | 3.0 MB | ########## | 100%
pip-20.1.1 | 1.8 MB | ########## | 100%
numpy-base-1.15.4 | 3.4 MB | ########## | 100%
wheel-0.34.2 | 51 KB | ########## | 100%
scipy-1.1.0 | 13.2 MB | #########3 | 93%
以及吊舱的描述:
PS C:\Users\ivarea\repo\smartgraph\mlflow-v2> kubectl describe pod model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Name: model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Namespace: default
Priority: 0
Node: mlops-control-plane/172.19.0.2
Start Time: Thu, 25 Jun 2020 10:08:20 +0200
Labels: app=model-a-wines-classifier-0-wines-classifier
fluentd=true
pod-template-hash=5b8bc7889d
seldon-app=model-a-wines-classifier
seldon-app-svc=model-a-wines-classifier-wines-classifier
seldon-deployment-id=model-a
version=wines-classifier
Annotations: prometheus.io/path: /prometheus
prometheus.io/scrape: true
Status: Running
IP: 10.244.0.17
IPs:
IP: 10.244.0.17
Controlled By: ReplicaSet/model-a-wines-classifier-0-wines-classifier-5b8bc7889d
Init Containers:
wines-classifier-model-initializer:
Container ID: containerd://6a3b158cf4218f8c177f6d18eb5d0387946bf9cc36f1173754b68a029483da8b
Image: gcr.io/kfserving/storage-initializer:0.2.2
Image ID: gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
Port: <none>
Host Port: <none>
Args:
gs://seldon-models/mlflow/model-a
/mnt/models
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 25 Jun 2020 10:08:24 +0200
Finished: Thu, 25 Jun 2020 10:08:47 +0200
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 100m
memory: 100Mi
Environment: <none>
Mounts:
/mnt/models from wines-classifier-provision-location (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Containers:
wines-classifier:
Container ID: containerd://536753d25877994a17d1f1a63bbaf8717dc9180b80f061152688e4c8504c8468
Image: seldonio/mlflowserver_rest:0.5
Image ID: docker.io/seldonio/mlflowserver_rest@sha256:0fd54a0a314fafc82c490c91df0c4776be454702a307b4b76e12ed6958b4ee00
Ports: 6000/TCP, 9000/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Thu, 25 Jun 2020 10:23:28 +0200
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 25 Jun 2020 10:19:09 +0200
Finished: Thu, 25 Jun 2020 10:20:41 +0200
Ready: False
Restart Count: 7
Liveness: tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
Readiness: tcp-socket :http delay=20s timeout=1s period=5s #success=1 #failure=3
Environment:
PREDICTIVE_UNIT_SERVICE_PORT: 9000
PREDICTIVE_UNIT_ID: wines-classifier
PREDICTIVE_UNIT_IMAGE: seldonio/mlflowserver_rest:0.5
PREDICTOR_ID: wines-classifier
PREDICTOR_LABELS: {"version":"wines-classifier"}
SELDON_DEPLOYMENT_ID: model-a
PREDICTIVE_UNIT_METRICS_SERVICE_PORT: 6000
PREDICTIVE_UNIT_METRICS_ENDPOINT: /prometheus
PREDICTIVE_UNIT_PARAMETERS: [{"name":"model_uri","value":"/mnt/models","type":"STRING"}]
Mounts:
/etc/podinfo from podinfo (rw)
/mnt/models from wines-classifier-provision-location (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
seldon-container-engine:
Container ID: containerd://938e8f7e3ac23355c8a7a475b71ab54b858aff5ca485f26b99feaba09bb60069
Image: docker.io/seldonio/seldon-core-executor:1.1.0
Image ID: docker.io/seldonio/seldon-core-executor@sha256:661173fcbc6cb4e9b56db353b19e97d04d9c086e9dc445217f84dc1721bdf894
Ports: 8000/TCP, 8000/TCP, 5001/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
--sdep
model-a
--namespace
default
--predictor
wines-classifier
--http_port
8000
--grpc_port
5001
--transport
rest
--protocol
seldon
--prometheus_path
/prometheus
State: Running
Started: Thu, 25 Jun 2020 10:08:51 +0200
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Liveness: http-get http://:8000/live delay=20s timeout=60s period=5s #success=1 #failure=3
Readiness: http-get http://:8000/ready delay=20s timeout=60s period=5s #success=1 #failure=3
Environment:
ENGINE_PREDICTOR: <binary ommited>
REQUEST_LOGGER_DEFAULT_ENDPOINT_PREFIX: http://default-broker.
SELDON_LOG_MESSAGES_EXTERNALLY: false
Mounts:
/etc/podinfo from podinfo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
wines-classifier-provision-location:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
default-token-6vqwk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6vqwk
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp to mlops-control-plane
Normal Pulled 15m kubelet, mlops-control-plane Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
Normal Created 15m kubelet, mlops-control-plane Created container wines-classifier-model-initializer
Normal Started 15m kubelet, mlops-control-plane Started container wines-classifier-model-initializer
Normal Pulled 15m kubelet, mlops-control-plane Container image "seldonio/mlflowserver_rest:0.5" already present on machine
Normal Created 15m kubelet, mlops-control-plane Created container wines-classifier
Normal Started 15m kubelet, mlops-control-plane Started container wines-classifier
Normal Pulled 15m kubelet, mlops-control-plane Container image "docker.io/seldonio/seldon-core-executor:1.1.0" already present on machine
Normal Created 14m kubelet, mlops-control-plane Created container seldon-container-engine
Normal Started 14m kubelet, mlops-control-plane Started container seldon-container-engine
Warning Unhealthy 14m (x8 over 14m) kubelet, mlops-control-plane Readiness probe failed: dial tcp 10.244.0.17:9000: connect: connection refused
Warning Unhealthy 28s (x171 over 14m) kubelet, mlops-control-plane Readiness probe failed: HTTP probe failed with statuscode: 503
如何禁用重新启动,以便检查日志以查看实际错误?