我有一个 1.16.2 版本的 Kubernetes 集群。当我在副本为 1 的集群中部署所有服务时,它工作正常。然后我将所有服务的副本扩展到 2 并签出。发现某些服务正在运行正常,但某些状态正在等待。当我 kubectl 描述其中一个待处理的 pod 时,我收到如下消息
[root@runsdata-bj-01 society-training-service-v1-0]# kcd society-resident-service-v3-0-788446c49b-rzjsx
Name: society-resident-service-v3-0-788446c49b-rzjsx
Namespace: runsdata
Priority: 0
Node: <none>
Labels: app=society-resident-service-v3-0
pod-template-hash=788446c49b
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/society-resident-service-v3-0-788446c49b
Containers:
society-resident-service-v3-0:
Image: docker.ssiid.com/society-resident-service:3.0.33
Port: 8231/TCP
Host Port: 0/TCP
Limits:
cpu: 1
memory: 4Gi
Requests:
cpu: 200m
memory: 2Gi
Liveness: http-get http://:8231/actuator/health delay=600s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:8231/actuator/health delay=30s timeout=5s period=10s #success=1 #failure=3
Environment:
spring_profiles_active: production
TZ: Asia/Hong_Kong
JAVA_OPTS: -Djgroups.use.jdk_logger=true -Xmx4000M -Xms4000M -Xmn600M -XX:PermSize=500M -XX:MaxPermSize=500M -Xss384K -XX:+DisableExplicitGC -XX:SurvivorRatio=1 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:LargePageSizeInBytes=128M -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -Xloggc:log/gc.log
Mounts:
/data/storage from nfs-data-storage (rw)
/opt/security from security (rw)
/var/log/runsdata from log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from application-token-vgcvb (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
log:
Type: HostPath (bare host directory volume)
Path: /log/runsdata
HostPathType:
security:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-security-claim
ReadOnly: false
nfs-data-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-storage-claim
ReadOnly: false
application-token-vgcvb:
Type: Secret (a volume populated by a Secret)
SecretName: application-token-vgcvb
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/4 nodes are available: 4 Insufficient memory.
从下面可以看到我的机器还有2G多内存。
[root@runsdata-bj-01 society-training-service-v1-0]# kcp |grep Pending
society-insurance-foundation-service-v2-0-7697b9bd5b-7btq6 0/1 Pending 0 60m
society-notice-service-v1-0-548b8d5946-c5gzm 0/1 Pending 0 60m
society-online-business-service-v2-1-7f897f564-phqjs 0/1 Pending 0 60m
society-operation-gateway-7cf86b77bd-lmswm 0/1 Pending 0 60m
society-operation-user-service-v1-1-755dcff964-dr9mj 0/1 Pending 0 60m
society-resident-service-v3-0-788446c49b-rzjsx 0/1 Pending 0 60m
society-training-service-v1-0-774f8c5d98-tl7vq 0/1 Pending 0 60m
society-user-service-v3-0-74865dd9d7-t9fwz 0/1 Pending 0 60m
traefik-ingress-controller-8688cccf79-5gkjg 0/1 Pending 0 60m
[root@runsdata-bj-01 society-training-service-v1-0]# kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
192.168.0.94 384m 9% 11482Mi 73%
192.168.0.95 399m 9% 11833Mi 76%
192.168.0.96 399m 9% 11023Mi 71%
192.168.0.97 457m 11% 10782Mi 69%
[root@runsdata-bj-01 society-training-service-v1-0]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
192.168.0.94 Ready <none> 8d v1.16.2
192.168.0.95 Ready <none> 8d v1.16.2
192.168.0.96 Ready <none> 8d v1.16.2
192.168.0.97 Ready <none> 8d v1.16.2
[root@runsdata-bj-01 society-training-service-v1-0]#
这是所有4个节点的描述
[root@runsdata-bj-01 frontend]#kubectl describe node 192.168.0.94
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1930m (48%) 7600m (190%)
memory 9846Mi (63%) 32901376Ki (207%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
[root@runsdata-bj-01 frontend]#kubectl describe node 192.168.0.95
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1670m (41%) 6600m (165%)
memory 7196Mi (46%) 21380Mi (137%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
[root@runsdata-bj-01 frontend]# kubectl describe node 192.168.0.96
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2610m (65%) 7 (175%)
memory 9612Mi (61%) 19960Mi (128%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
[root@runsdata-bj-01 frontend]# kubectl describe node 192.168.0.97
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2250m (56%) 508200m (12705%)
memory 10940Mi (70%) 28092672Ki (176%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
以及所有4个节点的内存:
[root@runsdata-bj-00 ~]# free -h
total used free shared buff/cache available
Mem: 15G 2.8G 6.7G 2.1M 5.7G 11G
Swap: 0B 0B 0B
[root@runsdata-bj-01 frontend]# free -h
total used free shared buff/cache available
Mem: 15G 7.9G 3.7G 2.4M 3.6G 6.8G
Swap: 0B 0B 0B
[root@runsdata-bj-02 ~]# free -h
total used free shared buff/cache available
Mem: 15G 5.0G 2.9G 3.9M 7.4G 9.5G
Swap: 0B 0B 0B
[root@runsdata-bj-03 ~]# free -h
total used free shared buff/cache available
Mem: 15G 6.5G 2.2G 2.3M 6.6G 8.2G
Swap: 0B 0B 0B
这是 kube-scheduler 日志:
[root@runsdata-bj-01 log]# cat messages|tail -n 5000|grep kube-scheduler
Apr 17 14:31:24 runsdata-bj-01 kube-scheduler: E0417 14:31:24.404442 12740 factory.go:585] pod is already present in the activeQ
Apr 17 14:31:25 runsdata-bj-01 kube-scheduler: E0417 14:31:25.490310 12740 factory.go:585] pod is already present in the backoffQ
Apr 17 14:31:25 runsdata-bj-01 kube-scheduler: E0417 14:31:25.873292 12740 factory.go:585] pod is already present in the backoffQ
Apr 18 21:44:18 runsdata-bj-01 etcd: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:440" took too long (100.521269ms) to execute
Apr 18 21:59:40 runsdata-bj-01 kube-scheduler: E0418 21:59:40.050852 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:03:07 runsdata-bj-01 kube-scheduler: E0418 22:03:07.069465 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:03:07 runsdata-bj-01 kube-scheduler: E0418 22:03:07.950254 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:03:08 runsdata-bj-01 kube-scheduler: E0418 22:03:08.567290 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:03:09 runsdata-bj-01 kube-scheduler: E0418 22:03:09.152812 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:03:09 runsdata-bj-01 kube-scheduler: E0418 22:03:09.344902 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:04:32 runsdata-bj-01 kube-scheduler: E0418 22:04:32.969606 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:09:51 runsdata-bj-01 kube-scheduler: E0418 22:09:51.366877 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:32:16 runsdata-bj-01 kube-scheduler: E0418 22:32:16.430976 12740 factory.go:585] pod is already present in the activeQ
Apr 18 22:32:16 runsdata-bj-01 kube-scheduler: E0418 22:32:16.441182 12740 factory.go:585] pod is already present in the activeQ
我搜索了谷歌和stackoverflow,找不到解决方案。谁能帮我 ?