1

Kubernetes kube-controller-manager 和 kube-scheduler 不断重启。以下是 pod 日志。

~$ kubectl logs -n kube-system kube-scheduler-node1 -p
I1228 16:59:26.709076 1 serving.go:319] 在内存中生成自签名证书
I1228 16:59:27.072726 1 server.go:143] 版本:v1.16.0
I1228 16:59:27.072806 1 defaults.go:91] TaintNodesByCondition 已启用,PodToleratesNodeTaints 谓词是强制性的
W1228 16:59:27.075087 1 authorization.go:47] 授权被禁用
W1228 16:59:27.075103 1 authentication.go:79] 身份验证已禁用
I1228 16:59:27.075117 1 deprecated_insecure_serving.go:51] 在 [::]:10251 上不安全地服务 healthz
I1228 16:59:27.075623 1 secure_serving.go:123] 在 [::]:10259 上安全服务
I1228 16:59:28.077293 1 leaderelection.go:241] 试图获取领导租约 kube-system/kube-scheduler...
E1228 16:59:45.353862 1 leaderelection.go:330] 检索资源锁 kube-system/kube-scheduler 时出错:获取 https://IPaddress/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s:net/ http:请求已取消(等待标头时超出 Client.Timeout)
I1228 16:59:47.969930 1 leaderelection.go:251] 成功获取租约 kube-system/kube-scheduler
I1228 17:00:42.008006 1 leaderelection.go:287] 未能更新租约 kube-system/kube-scheduler: 未能 tryAcquireOrRenew 超出上下文期限
F1228 17:00:42.008059 1 server.go:264] 领导选举失败
:~$ kubectl logs -n kube-system kube-controller-manager-node1 -p
W1228 17:00:04.721378 1 actual_state_of_world.go:506] 无法更新实际世界状态中的 statusUpdateNeeded 字段:无法将 statusUpdateNeeded 设置为需要 true,因为 nodeName="node4" 不存在
I1228 17:00:04.726825 1 shared_informer.go:204] 为证书同步缓存
I1228 17:00:04.732538 1 shared_informer.go:204] 缓存为 TTL 同步
I1228 17:00:04.739613 1 shared_informer.go:204] 为 ClusterRoleAggregator 同步缓存
I1228 17:00:04.754683 1 shared_informer.go:204] 为证书同步缓存
I1228 17:00:04.760101 1 shared_informer.go:204] 缓存为有状态集同步
I1228 17:00:04.768974 1 shared_informer.go:204] 为命名空间同步缓存
I1228 17:00:04.769914 1 shared_informer.go:204] 缓存已同步以进行部署
I1228 17:00:04.790541 1 shared_informer.go:204] 为守护程序集同步缓存
I1228 17:00:04.790710 1 shared_informer.go:204] 为 ReplicationController 同步缓存
I1228 17:00:04.796386 1 shared_informer.go:204] 缓存同步中断
I1228 17:00:04.796403 1 interrupt.go:341] 将事件发送到 api 服务器。
I1228 17:00:04.804131 1 shared_informer.go:204] 为 ReplicaSet 同步缓存
I1228 17:00:04.806910 1 shared_informer.go:204] 为 GC 同步缓存
I1228 17:00:04.809821 1 shared_informer.go:204] 缓存同步污染
I1228 17:00:04.809909 1 node_lifecycle_controller.go:1208] 初始化区域的驱逐指标:
W1228 17:00:04.809999 1 node_lifecycle_controller.go:903] 缺少节点 node3 的时间戳。假设现在作为时间戳。
W1228 17:00:04.810038 1 node_lifecycle_controller.go:903] 缺少节点 node4 的时间戳。假设现在作为时间戳。
W1228 17:00:04.810065 1 node_lifecycle_controller.go:903] 缺少节点 node1 的时间戳。假设现在作为时间戳。
W1228 17:00:04.810086 1 node_lifecycle_controller.go:903] 缺少节点 node2 的时间戳。假设现在作为时间戳。
I1228 17:00:04.810101 1 node_lifecycle_controller.go:1108] 控制器检测到该区域现在处于正常状态。
I1228 17:00:04.810145 1 event.go:255] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"node2", UID:"68d34fcf-fd86-42a5-9833-57108c93baee", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node2 event: Registered Node node2 in Controller
I1228 17:00:04.810164 1 taint_manager.go:186] 启动 NoExecuteTaintManager
I1228 17:00:04.810224 1 event.go:255] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"node3", UID:"dc80b75f-ce55-4247-84e3-bf0474ac1057", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node3 event: Registered Node node3 in Controller
I1228 17:00:04.810233 1 event.go:255] 事件(v1.ObjectReference{种类:“节点”,命名空间:“”,名称:“node4”,UID:“c9d859df-795e-4b2a-9def-08efc67ba4e3”, APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node4 event: Registered Node node4 in Controller
I1228 17:00:04.810242 1 event.go:255] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"node1", UID:"8bfe45c3-2ce7-4013-a11f-c1ac052e9e00", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node node1 event: Registered Node node1 in Controller
I1228 17:00:04.811241 1 shared_informer.go:204] 为节点同步缓存
I1228 17:00:04.811367 1 range_allocator.go:172] 起始范围 CIDR 分配器
I1228 17:00:04.811381 1 shared_informer.go:197] 等待缓存同步 cidrallocator
I1228 17:00:04.859423 1 shared_informer.go:204] 为 HPA 同步缓存
I1228 17:00:04.911545 1 shared_informer.go:204] 为 cidrallocator 同步缓存
I1228 17:00:04.997853 1 shared_informer.go:204] 为 bootstrap_signer 同步缓存
I1228 17:00:05.023218 1 shared_informer.go:204] 缓存已同步以进行扩展
I1228 17:00:05.030277 1 shared_informer.go:204] 缓存同步以进行 PV 保护
I1228 17:00:05.059763 1 shared_informer.go:204] 缓存已为端点同步
I1228 17:00:05.060705 1 shared_informer.go:204] 为持久卷同步缓存
I1228 17:00:05.118184 1 shared_informer.go:204] 缓存已同步以进行附加分离
I1228 17:00:05.246897 1 shared_informer.go:204] 缓存为作业同步
I1228 17:00:05.248850 1 shared_informer.go:204] 缓存已同步资源配额
I1228 17:00:05.257547 1 shared_informer.go:204] 为垃圾收集器同步缓存
I1228 17:00:05.257566 1garbagecollector.go:139] 垃圾收集器:所有资源监视器已同步。继续收集垃圾
I1228 17:00:05.260287 1 shared_informer.go:204] 缓存已同步资源配额
I1228 17:00:05.305093 1 shared_informer.go:204] 为垃圾收集器同步缓存
I1228 17:00:44.906594 1 leaderelection.go:287] 未能更新租约 kube-system/kube-controller-manager: 未能 tryAcquireOrRenew 超出上下文期限
F1228 17:00:44.906687 1 controllermanager.go:279] 领导选举失败
4

1 回答 1

1

增加节点的 CPU 和内存后,问题得到解决。

当您遇到资源紧张或网络问题时,会出现此问题。在我的例子中,领导选举 API 调用超时,因为 Kube API Server 出现资源紧缩,这增加了 API 调用的延迟。

K8S API Server 的日志:

apiserver was unable to write a JSON response: http: Handler timeout
apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
apiserver was unable to write a fallback JSON response: http: Handler timeout
于 2021-06-24T04:02:11.963 回答