我有一个在裸机上运行的内部 5 节点集群,我正在使用 Calico。集群工作了 22 天,但突然停止工作。在调查了这个问题后,我发现当所有组件都启动并且 kubectl 正常工作时,与 pod 通信的服务中断了。
如果我尝试bridge
使用其 IP 卷曲另一个组件(),则从集群(组件 A)中它可以工作:
$ curl -vvv http://10.4.130.184:9998
* Rebuilt URL to: http://10.4.130.184:9998/
* Trying 10.4.130.184...
* TCP_NODELAY set
* Connected to 10.4.130.184 (10.4.130.184) port 9998 (#0)
> GET / HTTP/1.1
> Host: 10.4.130.184:9998
> User-Agent: curl/7.58.0
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Accept-Ranges: bytes
< Cache-Control: public, max-age=0
< Last-Modified: Mon, 08 Apr 2019 14:06:42 GMT
< ETag: W/"179-169fd45c550"
< Content-Type: text/html; charset=UTF-8
< Content-Length: 377
< Date: Wed, 23 Oct 2019 09:56:35 GMT
< Connection: keep-alive
<
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<title>Bridge</title>
<meta content='width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0' name='viewport' />
<meta name="viewport" content="width=device-width" />
</head>
<body>
<h1>Bridge</h1>
</body>
</html>
* Connection #0 to host 10.4.130.184 left intact
ns 对服务的查找也在工作(它解析到服务 IP):
$ nslookup bridge
Server: 10.5.0.10
Address 1: 10.5.0.10 kube-dns.kube-system.svc.k8s.local
Name: bridge
Address 1: 10.5.160.50 bridge.170.svc.k8s.local
但是与 pod 通信的服务中断了,当我大多数时候(60-70%)卷曲到服务名称时,它卡住了:
$ curl -vvv http://bridge:9998
* Rebuilt URL to: http://bridge:9998/
* Could not resolve host: bridge
* Closing connection 0
curl: (6) Could not resolve host: bridge
当我检查该服务的端点时,我可以看到该 pod 的 IP 在那里:
$ kubectl get ep -n 170 bridge
NAME ENDPOINTS AGE
bridge 10.4.130.184:9226,10.4.130.184:9998,10.4.130.184:9226 11d
但正如我所说,使用服务名称的 curl(和任何其他方法)不起作用。这是服务描述:
$ kubectl describe svc -n 170 bridge
Name: bridge
Namespace: 170
Labels: io.kompose.service=bridge
Annotations: Process: bridge
Selector: io.kompose.service=bridge
Type: ClusterIP
IP: 10.5.160.50
Port: 9998 9998/TCP
TargetPort: 9998/TCP
Endpoints: 10.4.130.184:9998
Port: 9226 9226/TCP
TargetPort: 9226/TCP
Endpoints: 10.4.130.184:9226
Port: 9226-udp 9226/UDP
TargetPort: 9226/UDP
Endpoints: 10.4.130.184:9226
Session Affinity: None
Events: <none>
这个问题不仅限于这个组件,对于所有组件都是如此。
我重新启动了 CoreDNS(删除了它的 pod),但它仍然是一样的。我之前和以前都遇到过这个问题,我认为它与我正在使用的 Weavenet 有关,我需要集群,所以我拆除了集群并用 Calico 重建它,但现在我确定这与 CNI 无关,它是别的东西。
环境:- Kubernetes 版本(使用kubectl version
):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:27:17Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
云提供商或硬件配置:这是一个由 5 个节点、1 个主节点和 4 个工作节点组成的裸机集群。所有节点都运行 Ubuntu 18.04,并且它们连接到同一个子网。
操作系统(例如:)
cat /etc/os-release
:
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
- 内核(例如
uname -a
):
Linux serflex-argus-1 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
安装工具:Kubeadm
网络插件和版本(如果这是与网络相关的错误):Calico "cniVersion": "0.3.1"
更新
删除所有 kube-proxy pod 问题后似乎解决了,但我仍然想知道是什么导致了这个问题。顺便说一句,我在 kube-proxy 日志中没有看到任何错误。