docker - K8s 环境 - 从 Pod 卷曲一个端点（指向外部数据库）失败并超时

Question

问题陈述：

K8s 环境 - 从 Pod 卷曲 Endpoint（指向外部数据库）因超时而失败。

K8s 集群详情：使用 Rancher Kubernetes Engine (RKE) 和 Docker 作为容器运行时托管的 3 节点集群。

节点：

NAME       ROLES             
worker1    etcd,worker       
worker2    etcd,worker       
master     controlplane,etcd

由于此设置使用 RKE，因此 apiserver、kubelet 在所有节点上作为 docker 容器运行

Pod 和 Endpoint 位于同一个命名空间中。

pod内的命令：

curl <service-ip-of-endpoint>:<service-port-of-endpoint>

卷曲超时并失败。但是，如果我们从 pod 外部（即节点上）卷曲实际的数据库 ip 和端口，它会提供预期的响应。

我们正在尝试使用 ip route 和 tracepath 实用程序跟踪 curl 拍摄期间发出的数据包的路由。

ip route 命令每次在 pod 中发出时都会提供相同的响应。但是，由于多个 pod 在集群中共享相同的 IP（创建 pod 容器的节点 IP），因此 tracepath 每次提供不同的路径。

[root@master etc]# kubectl exec -it -n [namespace] [pod-name] sh
sh-4.2# ip route get [service ip of endpoint]
[service ip of endpoint] via [inaccessible software router ip] dev eth0 src [pod ip]
    cache
sh-4.2# ip route get [database ip for which we created the endpoint]
[database ip for which we created the endpoint] via [inaccessible software router ip] dev eth0 src [pod ip]
    cache

显示了在执行 tracepath 命令期间每次出现的不同路径。请注意，一旦请求到达网关，我们端点的服务 ip 的路径总是相同的。即使对于像 google.com 这样的一般事物，到达网关后的路径也是相同的。因此，网关后显示的路径已被删除，并且仅显示了输出中不断变化的部分。

'[the-ip]' 在所有情况下都是相同的 IP 地址，它是我们的 pod（我们正在从其运行）运行的节点的 IP 地址。

在每种情况下，网关之前的路径都是由 Rancher 作为守护程序集运行的不同 pod。

sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
 1?: [LOCALHOST]                                         pmtu 1450
 1:  [the-ip].pushprox-kube-proxy-client.cattle-monitoring-system.svc.cluster.local   0.075ms
 1:  [the-ip].pushprox-kube-proxy-client.cattle-monitoring-system.svc.cluster.local   0.041ms
 2:  [some-gateway]                                        0.486ms asymm  3

sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
 1?: [LOCALHOST]                                         pmtu 1450
 1:  [the-ip].rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local   0.065ms
 1:  [the-ip].rancher-monitoring-ingress-nginx.ingress-nginx.svc.cluster.local   0.031ms
 2:  [some-gateway]                                        0.496ms asymm  3

sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
 1?: [LOCALHOST]                                         pmtu 1450
 1:  [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local   0.093ms
 1:  [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local   0.054ms
 2:  [some-gateway]                                        0.480ms asymm  3
 
sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
 1?: [LOCALHOST]                                         pmtu 1450
 1:  [the-ip].rancher-monitoring-ingress-nginx.ingress-nginx.svc.cluster.local   0.082ms
 1:  [the-ip].pushprox-kube-proxy-client.cattle-monitoring-system.svc.cluster.local   0.041ms
 2:  [some-gateway]                                        0.626ms asymm  3

sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
 1?: [LOCALHOST]                                         pmtu 1450
 1:  [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local   0.103ms
 1:  [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local   0.052ms
 2:  [some-gateway]                                      0.501ms asymm  3

sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
 1?: [LOCALHOST]                                         pmtu 1450
 1:  [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local   0.159ms
 1:  [the-ip].rancher-monitoring-kubelet.kube-system.svc.cluster.local   0.082ms
 2:  [some-gateway]                                       0.725ms asymm  3

我们还在 pod 的 eth0 接口处执行了 tcpdump，并确定请求通过 kube-system 命名空间中的 coredns pod。但是，coredns pod 中没有输出日志表明请求的传入或传出。卷曲一些通用网站（如 google.com）时也会发生同样的情况，除了 curl 到 google.com 成功而数据库端点的服务和端口失败（我们的例子）。

我们应该注意对特定日志记录或组件的任何响应，以识别请求失败的路由点。

docker - K8s 环境 - 从 Pod 卷曲一个端点（指向外部数据库）失败并超时

0 回答 0

Related

Reference