我正在运行一个 Kubernetes 集群,它已经运行了几个月。现在,今天,当我要部署一些更新时,我从服务器收到超时。
运行$ kubectl get nodes
产量
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
运行$ kubectl get pods --all-namespaces
产量
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
运行$ kubectl get deployments
产量
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.extensions)
运行$ kubectl get svc
产量
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services)
运行$ kubectl cluster-info
yield(注意master之后没有输出)
Kubernetes master is running at https://cluster.mysite.com
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
当我为每个命令获得这些超时时,故障排除是不可能的。
我怎样才能从这里继续访问我的服务器?我正在使用kube-aws
和 AWS CloudFormation VPC。
谢谢你的时间。
编辑:
根据请求,我运行$ kubectl get pods -v 7
并在一堆缓存返回后得到了这个:
I0103 16:51:32.196859 25644 round_trippers.go:414] GET cluster.mysite.com/api/v1/nodes
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
我也跑了$ kubectl cluster-info dump -v 7
,得到:
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894 25644 round_trippers.go:424] Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424] User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I0103 16:52:32.242362 25644 helpers.go:207] server response object: [{
"metadata": {},
"status": "Failure",
"message": "the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)",
"reason": "Timeout",
"details": {
"kind": "nodes",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against nodes could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"nodes\"},\"code\":500}"
}
]
},
"code": 504
}]
编辑 2:
好的,现在我正在接受Unable to connect to the server: EOF
每个请求,我开始感到害怕。这是一个生产集群,我什至无法访问它来尝试进行故障排除。有人对如何进行有提示吗?
编辑 3: 我已经意识到 etcd 集群无法正常工作,2/3 节点不同步。重新启动一个节点使其再次正确加入集群,但第二个节点无法启动服务。未启动的服务有:
- etcdadm-check.service
- etcdadm-save.service
- etcdadm-update-status.service
- 用户@0.service
前三个都给出错误etcdadm-check.service: Control process exited, code=exited status=3
,最后一个给出user@0.service: Start request repeated too quickly.
。
有关如何处理此问题的任何提示?
此外,在恢复第二个 etcd 后,我Unable to connect to the server: x509: certificate signed by unknown authority
在运行任何kubectl
命令时都会得到。这是否意味着数据丢失?我的证书还有半年多的有效期,我没有改变任何东西。
编辑 4:我仍然有 etcd 问题,但此时我按照 camil 的回答中的说明进行操作,将更新结果。但是,我解决了证书无效的问题,只需$ kube-aws render credentials
使用到我的中间根 CA 的正确路径重新运行即可,从而解决了该问题。