每次我初始化一个新集群时,从 3 天到一个月左右的任何时间都可以完美运行。然后 kube-dns 就会停止运行。我可以进入 kubedns 容器,它似乎运行良好,虽然我真的不知道要寻找什么。我可以 ping 一个主机名,它可以解析并且可以访问,所以 kubedns 容器本身仍然有 dns 服务。它只是不为集群中的其他容器提供它。并且失败发生在自启动之前一直在运行的两个容器中(因此它们曾经能够解析+ping主机名,但现在无法解析它,但仍然可以使用IP ping),以及创建的新容器。
我不确定这是否与时间有关,或者与已创建的作业或 pod 的数量有关。最近的事件发生在创建了 32 个 Pod 和 20 个工作岗位之后。
如果我使用以下命令删除 kube-dns pod:
kubectl delete pod --namespace kube-system kube-dns-<pod_id>
创建了一个新的 kube-dns pod,一切恢复正常(DNS 适用于所有新旧容器)。
我有一个主节点和两个工作节点。它们都是 CentOS 7 机器。
要设置集群,在主服务器上,我运行:
systemctl start etcd
etcdctl mkdir /kube-centos/network
etcdctl mk /kube-centos/network/config "{ "Network": "172.30.0.0/16", "SubnetLen": 24, "Backend": { "Type": "vxlan" } }"
systemctl disable etcd && systemctl stop etcd
systemctl enable docker && systemctl start docker
systemctl enable kubelet && systemctl start kubelet
kubeadm init --kubernetes-version v1.10.0 --apiserver-advertise-address=$(hostname) --ignore-preflight-errors=DirAvailable--var-lib-etcd
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=1.10&env.WEAVE_NO_FASTDP=1&env.CHECKPOINT_DISABLE=1"
kubectl --kubeconfig=/etc/kubernetes/admin.conf proxy -p 80 --accept-hosts='.*' --address=<master_ip> &
在这两个工人身上,我运行:
kubeadm join --token <K8S_MASTER_HOST>:6443 --discovery-token-ca-cert-hash sha256: --ignore-preflight-errors cri
这是我运行的一些 shell 命令+输出,它们可能很有用:
在故障开始发生之前,这是一个在其中一个工作人员上运行的容器:
bash-4.4# env
PACKAGES= dumb-init musl libc6-compat linux-headers build-base bash git ca-certificates python3 python3-dev
HOSTNAME=network-test
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1
KUBERNETES_PORT=tcp://10.96.0.1:443
PWD=/
HOME=/root
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_PORT=443
ALPINE_VERSION=3.7
KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443
TERM=xterm
SHLVL=1
KUBERNETES_SERVICE_PORT=443
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
KUBERNETES_SERVICE_HOST=10.96.0.1
_=/usr/bin/env
bash-4.4# ifconfig
eth0 Link encap:Ethernet HWaddr 8A:DD:6E:E8:C4:E3
inet addr:10.44.0.1 Bcast:10.47.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
RX packets:1 errors:0 dropped:0 overruns:0 frame:0
TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:42 (42.0 B) TX bytes:42 (42.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
bash-4.4# ip route
default via 10.44.0.0 dev eth0
10.32.0.0/12 dev eth0 scope link src 10.44.0.1
bash-4.4# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local dc1.int.company.com dc2.int.company.com dc3.int.company.com
options ndots:5
-- Note that even when working, the DNS IP can't be pinged. --
bash-4.4# ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
-- Never unblocks. So we know it's fine that the container can't ping the DNS IP. --
bash-4.4# ping 10.44.0.0
PING 10.44.0.0 (10.44.0.0): 56 data bytes
64 bytes from 10.44.0.0: seq=0 ttl=64 time=0.139 ms
64 bytes from 10.44.0.0: seq=1 ttl=64 time=0.124 ms
--- 10.44.0.0 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.124/0.131/0.139 ms
bash-4.4# ping somehost.env.dc1.int.company.com
PING somehost.env.dc1.int.company.com (10.112.17.2): 56 data bytes
64 bytes from 10.112.17.2: seq=0 ttl=63 time=0.467 ms
64 bytes from 10.112.17.2: seq=1 ttl=63 time=0.271 ms
64 bytes from 10.112.17.2: seq=2 ttl=63 time=0.214 ms
64 bytes from 10.112.17.2: seq=3 ttl=63 time=0.241 ms
64 bytes from 10.112.17.2: seq=4 ttl=63 time=0.350 ms
--- somehost.env.dc1.int.company.com ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.214/0.308/0.467 ms
bash-4.4# ping 10.112.17.2
PING 10.112.17.2 (10.112.17.2): 56 data bytes
64 bytes from 10.112.17.2: seq=0 ttl=63 time=0.474 ms
64 bytes from 10.112.17.2: seq=1 ttl=63 time=0.404 ms
--- 10.112.17.2 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.404/0.439/0.474 ms
bash-4.4# ping worker1.env
PING worker1.env (10.112.5.50): 56 data bytes
64 bytes from 10.112.5.50: seq=0 ttl=64 time=0.051 ms
64 bytes from 10.112.5.50: seq=1 ttl=64 time=0.102 ms
--- worker1.env ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.051/0.076/0.102 ms
失败开始后,一直在运行的同一个容器:
bash-4.4# env
PACKAGES= dumb-init musl libc6-compat linux-headers build-base bash git ca-certificates python3 python3-dev
HOSTNAME=vda-test
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1
KUBERNETES_PORT=tcp://10.96.0.1:443
PWD=/
HOME=/root
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP_PORT=443
ALPINE_VERSION=3.7
KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443
TERM=xterm
SHLVL=1
KUBERNETES_SERVICE_PORT=443
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
KUBERNETES_SERVICE_HOST=10.96.0.1
OLDPWD=/root
_=/usr/bin/env
bash-4.4# ifconfig
eth0 Link encap:Ethernet HWaddr 22:5E:D5:72:97:98
inet addr:10.44.0.2 Bcast:10.47.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
RX packets:1645 errors:0 dropped:0 overruns:0 frame:0
TX packets:1574 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:718909 (702.0 KiB) TX bytes:150313 (146.7 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
bash-4.4# ip route
default via 10.44.0.0 dev eth0
10.32.0.0/12 dev eth0 scope link src 10.44.0.2
bash-4.4# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local dc1.int.company.com dc2.int.company.com dc3.int.company.com
options ndots:5
bash-4.4# ping 10.44.0.0
PING 10.44.0.0 (10.44.0.0): 56 data bytes
64 bytes from 10.44.0.0: seq=0 ttl=64 time=0.130 ms
64 bytes from 10.44.0.0: seq=1 ttl=64 time=0.097 ms
64 bytes from 10.44.0.0: seq=2 ttl=64 time=0.072 ms
64 bytes from 10.44.0.0: seq=3 ttl=64 time=0.102 ms
64 bytes from 10.44.0.0: seq=4 ttl=64 time=0.116 ms
64 bytes from 10.44.0.0: seq=5 ttl=64 time=0.099 ms
64 bytes from 10.44.0.0: seq=6 ttl=64 time=0.167 ms
64 bytes from 10.44.0.0: seq=7 ttl=64 time=0.086 ms
--- 10.44.0.0 ping statistics ---
8 packets transmitted, 8 packets received, 0% packet loss
round-trip min/avg/max = 0.072/0.108/0.167 ms
bash-4.4# ping somehost.env.dc1.int.company.com
ping: bad address 'somehost.env.dc1.int.company.com'
bash-4.4# ping 10.112.17.2
PING 10.112.17.2 (10.112.17.2): 56 data bytes
64 bytes from 10.112.17.2: seq=0 ttl=63 time=0.523 ms
64 bytes from 10.112.17.2: seq=1 ttl=63 time=0.319 ms
64 bytes from 10.112.17.2: seq=2 ttl=63 time=0.304 ms
--- 10.112.17.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.304/0.382/0.523 ms
bash-4.4# ping worker1.env
ping: bad address 'worker1.env'
bash-4.4# ping 10.112.5.50
PING 10.112.5.50 (10.112.5.50): 56 data bytes
64 bytes from 10.112.5.50: seq=0 ttl=64 time=0.095 ms
64 bytes from 10.112.5.50: seq=1 ttl=64 time=0.073 ms
64 bytes from 10.112.5.50: seq=2 ttl=64 time=0.083 ms
--- 10.112.5.50 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.073/0.083/0.095 ms
以下是 kube-dns 容器中的一些命令:
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 9A:24:59:D1:09:52
inet addr:10.32.0.2 Bcast:10.47.255.255 Mask:255.240.0.0
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
RX packets:4387680 errors:0 dropped:0 overruns:0 frame:0
TX packets:4124267 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1047398761 (998.8 MiB) TX bytes:1038950587 (990.8 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:4352618 errors:0 dropped:0 overruns:0 frame:0
TX packets:4352618 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:359275782 (342.6 MiB) TX bytes:359275782 (342.6 MiB)
/ # ping somehost.env.dc1.int.company.com
PING somehost.env.dc1.int.company.com (10.112.17.2): 56 data bytes
64 bytes from 10.112.17.2: seq=0 ttl=63 time=0.430 ms
64 bytes from 10.112.17.2: seq=1 ttl=63 time=0.252 ms
--- somehost.env.dc1.int.company.com ping statistics ---
2 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.208/0.274/0.430 ms
/ # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53152 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:sunproxyadmin 10.32.0.1:58424 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53174 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:sunproxyadmin 10.32.0.1:58468 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:sunproxyadmin 10.32.0.1:58446 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53096 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:sunproxyadmin 10.32.0.1:58490 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53218 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53100 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53158 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53180 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:sunproxyadmin 10.32.0.1:58402 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53202 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53178 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:sunproxyadmin 10.32.0.1:58368 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53134 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53200 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53136 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53130 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53222 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53196 TIME_WAIT
tcp 0 0 kube-dns-86f4d74b45-2kxdr:48230 10.96.0.1:https ESTABLISHED
tcp 0 0 kube-dns-86f4d74b45-2kxdr:10054 10.32.0.1:53102 TIME_WAIT
netstat: /proc/net/tcp6: No such file or directory
netstat: /proc/net/udp6: No such file or directory
netstat: /proc/net/raw6: No such file or directory
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
主节点+工作节点上的版本/操作系统信息:
kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
uname -a
Linux master1.env.dc1.int.company.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux