amazon-web-services - kubectl 命令超时，没有详细信息

Question

我正在运行一个 Kubernetes 集群，它已经运行了几个月。现在，今天，当我要部署一些更新时，我从服务器收到超时。

运行$ kubectl get nodes产量

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)

运行$ kubectl get pods --all-namespaces产量

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get pods)

运行$ kubectl get deployments产量

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get deployments.extensions)

运行$ kubectl get svc产量

Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get services)

运行$ kubectl cluster-infoyield（注意master之后没有输出）

Kubernetes master is running at https://cluster.mysite.com

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

当我为每个命令获得这些超时时，故障排除是不可能的。

我怎样才能从这里继续访问我的服务器？我正在使用kube-aws和 AWS CloudFormation VPC。

谢谢你的时间。

编辑：

根据请求，我运行$ kubectl get pods -v 7并在一堆缓存返回后得到了这个：

I0103 16:51:32.196859 25644 round_trippers.go:414] GET cluster.mysite.com/api/v1/nodes
I0103 16:51:32.196888 25644 round_trippers.go:421] Request Headers: 
I0103 16:51:32.196894 25644 round_trippers.go:424]     Accept: application/json
I0103 16:51:32.196899 25644 round_trippers.go:424]     User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841 25644 round_trippers.go:439]     Response Status: 504 Gateway Timeout in 60044 milliseconds

我也跑了$ kubectl cluster-info dump -v 7，得到：

I0103 16:51:32.196888   25644 round_trippers.go:421] Request Headers:
I0103 16:51:32.196894   25644 round_trippers.go:424]     Accept: application/json
I0103 16:51:32.196899   25644 round_trippers.go:424]     User-Agent: kubectl/v1.8.3 (darwin/amd64) kubernetes/f0efb3c
I0103 16:52:32.239841   25644 round_trippers.go:439] Response Status: 504 Gateway Timeout in 60044 milliseconds
I0103 16:52:32.242362   25644 helpers.go:207] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)",
  "reason": "Timeout",
  "details": {
    "kind": "nodes",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"The list operation against nodes could not be completed at this time, please try again.\",\"reason\":\"ServerTimeout\",\"details\":{\"name\":\"list\",\"kind\":\"nodes\"},\"code\":500}"
      }
    ]
  },
  "code": 504
}]

编辑 2： 好的，现在我正在接受Unable to connect to the server: EOF每个请求，我开始感到害怕。这是一个生产集群，我什至无法访问它来尝试进行故障排除。有人对如何进行有提示吗？

编辑 3： 我已经意识到 etcd 集群无法正常工作，2/3 节点不同步。重新启动一个节点使其再次正确加入集群，但第二个节点无法启动服务。未启动的服务有：

etcdadm-check.service
etcdadm-save.service
etcdadm-update-status.service
用户@0.service

前三个都给出错误etcdadm-check.service: Control process exited, code=exited status=3，最后一个给出user@0.service: Start request repeated too quickly.。

有关如何处理此问题的任何提示？

此外，在恢复第二个 etcd 后，我Unable to connect to the server: x509: certificate signed by unknown authority在运行任何kubectl命令时都会得到。这是否意味着数据丢失？我的证书还有半年多的有效期，我没有改变任何东西。

编辑 4：我仍然有 etcd 问题，但此时我按照 camil 的回答中的说明进行操作，将更新结果。但是，我解决了证书无效的问题，只需$ kube-aws render credentials使用到我的中间根 CA 的正确路径重新运行即可，从而解决了该问题。

score 3 · Accepted Answer

为避免超时，您可以传递此标志--request-timeout='1s'。这将允许进一步调试。

我看到您正在运行kube-aws，因此终止主实例是安全的（至少一个，如果您运行多个主实例）。ASG 将自动替换它们。您也可以使用 ETCD 节点执行此操作。

如果问题仍然存在，那么您必须通过 ssh 进入 master 并通过运行以下命令检查日志和服务：

journalctl -xe
systemctl status -l kubelet.service
systemctl status -l flanneld.service
systemctl status -l docker.service
rkt list

您还可以使用此功能kubectl从 master 内部进行调试：

kubectl() {
/usr/bin/docker run --rm --net=host \
  -v /etc/resolv.conf:/etc/resolv.conf \
  -v /srv/kube-aws/plugins:/srv/kube-aws/plugins \
  quay.io/coreos/hyperkube:v1.9.0_coreos.0 /hyperkube kubectl "$@"
}

然后尝试以下命令：

kubectl get componentstatus
kubectl cluster-info
kubectl get pods -n kube-system
kubectl get events -n kube-system

检查从主设备到 ETCD 的连接

export $(cat /etc/etcd-environment | tr -d "'")

/usr/bin/etcdctl \
--ca-file=/etc/kubernetes/ssl/etcd-trusted-ca.pem \
--cert-file=/etc/kubernetes/ssl/etcd-client.pem \
--key-file=/etc/kubernetes/ssl/etcd-client-key.pem \
--endpoints="${ETCD_ENDPOINTS}" \
cluster-health

score 1 · Accepted Answer

rm -r ~/.kube/cache/discovery为我工作。

不过，我的超时消息看起来与您的不同：

E0528 20:32:29.191243    1730 request.go:975] Unexpected error when reading response body: net/http: request canceled (Client.Timeout exceeded while reading body)

amazon-web-services - kubectl 命令超时，没有详细信息

2 回答 2

Related

Reference