我们在 k8s 上运行了 Prometheus,但由于 RAM 需求不足(并且 CPU 也接近极限),它不再启动。由于这对我来说是全新的,我不确定要采取哪种方法。我尝试部署容器时增加了一点 RAM 限制(节点有 16Gi,我从 145xxMi 增加到 15Gi)。状态一直处于未决状态。
Normal NotTriggerScaleUp 81s (x16 over 5m2s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) didn't match node selector, 2 Insufficient memory
Warning FailedScheduling 80s (x6 over 5m23s) default-scheduler 0/10 nodes are available: 10 Insufficient memory, 6 node(s) didn't match node selector, 9 Insufficient cpu.
Normal NotTriggerScaleUp 10s (x14 over 5m12s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 3 node(s) didn't match node selector
这些是普罗米修斯崩溃并且不再启动时的日志。describe pod 还表示内存使用率为 99%:
level=info ts=2020-10-09T09:39:34.745Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53476 maxSegment=53650
level=info ts=2020-10-09T09:39:38.518Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53477 maxSegment=53650
level=info ts=2020-10-09T09:39:41.244Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=53478 maxSegment=53650
我能做些什么来解决这个问题?请注意,没有自动缩放。
我是否要手动扩展 EC2 工作程序节点?我做点别的吗?