我们已成功为我在 Azure (AKS) 上的图像处理应用程序部署了以下设置:
- 具有 1 个 GPU 节点的 AKS 集群(需要根据传入流量进行扩展)
- 1 个在 GPU 节点上运行 Tensorflow 模型的 pod(由于内存限制,每个节点最多 1 个 pod)
- Prometheus 抓取 GPU 利用率指标(NVIDIA 的 DCGM 导出器)
- 用于水平 pod 自动缩放器 (HPA) 的 KEDA 缩放对象 - 与我们的部署在同一命名空间中
- 询问:
ceil(avg_over_time(DCGM_FI_DEV_GPU_UTIL{namespace="myproject"}[2m])
部署基于:
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack https://keda.sh/docs/1.4/scalers/prometheus/
通过此设置,它可以根据 DCGM GPU 利用率指标自动缩放 pod(水平),从 1 个 pod 扩展到 2 个 pod。因此,cluster-autoscaler 跟踪并将集群中的 GPU 节点数量从 1 扩展到 2。新的所需 pod 成功分配给这个新节点,平均 GPU 利用率降低。但是,在新添加的节点和第二个 pod 分配给它之后,KEDA HPA 对象无法再获取外部 GPU 指标。因此,HPA 对象不起作用并且无法缩减 pod,因此 pod(和节点)的数量保持为 2。
两个节点上的所有 pod 和服务似乎都很健康。此外,DCGM 导出器在新节点上运行,因此它应该能够从该节点抓取指标。
有没有人有这方面的经验?或者知道如何调试?在我描述 HPA 时的输出下方。
如果我们使用除 DCGM 之外的其他指标,例如http_request_total
,我们能够从所有节点获取指标,因此我们预计 DCGM 部分中存在错误,这是我们需要 GPU 指标的部分。我们已经在命名空间中安装了 DCGM,dcgm-exporter
并且还在 PrometheusadditionalScrapeConfig
部分进行了配置。
如果您需要额外的信息以任何方式提供帮助,也请告诉我!先感谢您。
kubectl describe hpa keda-hpa-prometheus-scaled-object -n myproject
Name: keda-hpa-prometheus-scaled-object
Namespace: myproject
Labels: app.kubernetes.io/managed-by=keda-operator
app.kubernetes.io/name=keda-hpa-prometheus-scaled-object
app.kubernetes.io/part-of=prometheus-scaled-object
app.kubernetes.io/version=2.0.0
deploymentName=myproject-deployment
scaledObjectName=prometheus-scaled-object
Annotations: <none>
CreationTimestamp: Thu, 22 Apr 2021 13:30:57 +0200
Reference: Deployment/myproject-deployment
Metrics: ( current / target )
"prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL" (target average value): 41 / 60
Min replicas: 1
Max replicas: 2
Deployment pods: 2 current / 2 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetExternalMetric the HPA was unable to compute the replica count: unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 45m (x12 over 47m) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL external metric: unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util
Warning FailedGetExternalMetric 2m55s (x178 over 47m) horizontal-pod-autoscaler unable to get external metric myproject/prometheus-http---XX-X-XXX-XXX-9090--DCGM_FI_DEV_GPU_UTIL/&LabelSelector{MatchLabels:map[string]string{scaledObjectName: prometheus-scaled-object,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for prometheus-http---XX-X-XXX-XXX-9090--dcgm_fi_dev_gpu_util