2

我在 GCP 上用 kubeadm(1 个 master + 2 个 worker)创建了一个 2 节点 k8s 集群,一切似乎都很好,除了 pod 到 pod 的通信。

因此,首先,集群中没有明显的问题。所有 pod 都在运行。没有错误,没有crushloopbackoffs,没有挂起的pod。

我强制使用以下场景进行测试:

NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE   IP            NODE           
default       bb-9bd94cf6f-b5cj5                    1/1     Running   1          19h   192.168.2.3   worker-node-1  
default       curler-7668c66bf5-6c6v8               1/1     Running   1          20h   192.168.2.2   worker-node-1  
default       curler-master-5b86858f9f-c6zhq        1/1     Running   0          18h   192.168.0.6   master-node    
default       nginx-5c7588df-x42vt                  1/1     Running   0          19h   192.168.2.4   worker-node-1  
default       nginy-6d77947646-4r4rl                1/1     Running   0          20h   192.168.1.4   worker-node-2  
kube-system   calico-node-9v98k                     2/2     Running   0          97m   10.240.0.7    master-node    
kube-system   calico-node-h2px8                     2/2     Running   0          97m   10.240.0.9    worker-node-2  
kube-system   calico-node-qjn5t                     2/2     Running   0          97m   10.240.0.8    worker-node-1  
kube-system   coredns-86c58d9df4-gckhl              1/1     Running   0          97m   192.168.1.9   worker-node-2  
kube-system   coredns-86c58d9df4-wvt2n              1/1     Running   0          97m   192.168.2.6   worker-node-1  
kube-system   etcd-master-node                      1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-apiserver-master-node            1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-controller-manager-master-node   1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-proxy-2g85h                      1/1     Running   0          97m   10.240.0.8    worker-node-1  
kube-system   kube-proxy-77pq4                      1/1     Running   0          97m   10.240.0.9    worker-node-2  
kube-system   kube-proxy-bbd2d                      1/1     Running   0          97m   10.240.0.7    master-node    
kube-system   kube-scheduler-master-node            1/1     Running   0          97m   10.240.0.7    master-node  

这些是服务:

$ kubectl get svc --all-namespaces
NAMESPACE     NAME           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
default       kubernetes     ClusterIP   10.96.0.1        <none>        443/TCP         21h
default       nginx          ClusterIP   10.109.136.120   <none>        80/TCP          20h
default       nginy          NodePort    10.101.111.222   <none>        80:30066/TCP    20h
kube-system   calico-typha   ClusterIP   10.111.238.0     <none>        5473/TCP        21h
kube-system   kube-dns       ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP   21h

nginxnginy服务指向nginx-xxxnginy-xxxpod,并且正在运行nginxcurlers是带有 curl 和 ping 的 pod。其中一个在主节点上运行,另一个在 worker-node-1 上运行。如果我访问在 worker-node-1 (curler-7668c66bf5-6c6v8) 上运行的 curler pod 和curl同一节点上的 nginx pod,它工作正常。

$ kubectl exec -it curler-7668c66bf5-6c6v8 sh
/ # curl 192.168.2.4 -I
HTTP/1.1 200 OK
Server: nginx/1.15.12
Date: Tue, 07 May 2019 10:59:06 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 16 Apr 2019 13:08:19 GMT
Connection: keep-alive
ETag: "5cb5d3c3-264"
Accept-Ranges: bytes

如果我通过服务名称尝试相同的事情,它会像coredns运行一样工作一半;一个在worker-node-1上,另一个在worker-node-2上。我相信如果请求发送到在 worker-node-1 上运行的 coredns pod,它可以工作,但是当它发送到 worker-node-2 时,它不会。

/ # curl nginx -I
curl: (6) Could not resolve host: nginx

/ # curl nginx -I
HTTP/1.1 200 OK
Server: nginx/1.15.12
Date: Tue, 07 May 2019 11:06:13 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 16 Apr 2019 13:08:19 GMT
Connection: keep-alive
ETag: "5cb5d3c3-264"
Accept-Ranges: bytes

所以,我的 pod 到 pod 通信肯定不起作用。我检查了 calico daemonset pod 的日志,但没有任何可疑之处。我确实在kube-proxypod 中有一些可疑的日志:

$ kubectl logs kube-proxy-77pq4 -n kube-system
W0507 09:16:51.305357       1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I0507 09:16:51.315528       1 server_others.go:148] Using iptables Proxier.
I0507 09:16:51.315775       1 server_others.go:178] Tearing down inactive rules.
E0507 09:16:51.356243       1 proxier.go:563] Error removing iptables rules in ipvs proxier: error deleting chain "KUBE-MARK-MASQ": exit status 1: iptables: Too many links.
I0507 09:16:51.648112       1 server.go:464] Version: v1.13.1
I0507 09:16:51.658690       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0507 09:16:51.659034       1 config.go:102] Starting endpoints config controller
I0507 09:16:51.659052       1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I0507 09:16:51.659076       1 config.go:202] Starting service config controller
I0507 09:16:51.659083       1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I0507 09:16:51.759278       1 controller_utils.go:1034] Caches are synced for endpoints config controller
I0507 09:16:51.759291       1 controller_utils.go:1034] Caches are synced for service config controller

谁能告诉我问题是否可能是由于kube-proxyiptables 配置错误造成的?或者指出我遗漏的任何东西?

4

1 回答 1

0

该问题已由原始海报解决,解决方案如下:

问题是我必须在 GCP 中的防火墙规则中打开 IP 通信中的 IP。现在它可以工作了

于 2019-06-25T08:21:54.590 回答