我在运行 CoreOS 1409.7.0 的裸机设置中运行不安全的测试 Kubernetes v1.7.5。我已经在主节点上安装了 api-server、控制器、调度程序、代理和 kubelet,在其他 3 个工作节点上安装了 kubelet 和代理,flanneld 使用contrib/init k8s 项目中提供的 systemd 服务文件。
集群启动时一切都运行良好。我可以部署仪表板和我自定义的一些部署(consul 客户端/服务器、nginx 等),它们都工作得很好。但是,如果我让集群运行几个小时,我会回来,每个 pod 都将处于 CrashLoopBackup 中,并被重新启动多次。解决问题的唯一方法是在每台机器上重新启动 kubelet。问题立即消失,一切恢复正常。
kubelet 进入不良状态后的日志:
Sep 10 19:09:06 k8-app-2.example.com kubelet[1025]: , failed to "StartContainer" for "nginx-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=nginx-server pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:06 k8-app-2.example.com kubelet[1025]: ]
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.286367 1025 kuberuntime_manager.go:457] Container {Name:nginx-server Image:nginx Command:[] Args:[] WorkingDir: Ports:[{Name:http HostPort:0 ContainerPort:80 Protocol:TCP HostIP:}] EnvFrom:[] Env:[{Name:NODE_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/,Port:80,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:10,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.286795 1025 kuberuntime_manager.go:457] Container {Name:regup Image:registry.hub.docker.com/spunon/regup:latest Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[{Name:SERVICE_NAME Value:nginx ValueFrom:nil} {Name:SERVICE_PORT Value:80 ValueFrom:nil} {Name:NODE_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {Name:POD_IP Value: ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:Always SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287071 1025 kuberuntime_manager.go:741] checking backoff for container "nginx-server" in pod "nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287376 1025 kuberuntime_manager.go:751] Back-off 5m0s restarting failed container=nginx-server pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287601 1025 kuberuntime_manager.go:741] checking backoff for container "regup" in pod "nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)"
Sep 10 19:09:07 k8-app-2.example.com kubelet[1025]: I0910 19:09:07.287863 1025 kuberuntime_manager.go:751] Back-off 5m0s restarting failed container=regup pod=nginx-deployment-617048525-mgf0v_default(f6dff9f2-95db-11e7-b533-02c75fb65df0)