我们有一个在 GKE 上运行的 2 节点集群,很长一段时间以来,由于 kubelet 服务随机重启,我们一直在遭受随机停机。
这些是当时运行 kubernetes 系统 pod 的节点中最后一次停机的日志:
sudo journalctl -r -u kubelet
Aug 22 04:17:36 gke-app-node1 systemd[1]: Started Kubernetes kubelet.
Aug 22 04:17:36 gke-app-node1 systemd[1]: Stopped Kubernetes kubelet.
Aug 22 04:17:36 gke-app-node1 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Aug 22 04:10:46 gke-app-node1 systemd[1]: Started Kubernetes kubelet.
Aug 22 04:10:46 gke-app-node1 systemd[1]: Stopped Kubernetes kubelet.
Aug 22 04:10:46 gke-app-node1 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Aug 22 04:09:48 gke-app-node1 systemd[1]: Started Kubernetes kubelet.
Aug 22 04:09:46 gke-app-node1 systemd[1]: Stopped Kubernetes kubelet.
Aug 22 04:09:44 gke-app-node1 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Aug 22 04:02:05 gke-app-node1 systemd[1]: Started Kubernetes kubelet.
Aug 22 04:02:03 gke-app-node1 systemd[1]: Stopped Kubernetes kubelet.
Aug 22 04:02:03 gke-app-node1 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Aug 22 04:01:09 gke-app-node1 systemd[1]: Started Kubernetes kubelet.
Aug 22 04:01:09 gke-app-node1 systemd[1]: Stopped Kubernetes kubelet.
Aug 22 04:01:08 gke-app-node1 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Aug 22 04:00:58 gke-app-node1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Aug 22 04:00:58 gke-app-node1 systemd[1]: kubelet.service: Unit entered failed state.
Aug 22 04:00:58 gke-app-node1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 04:00:58 gke-app-node1 kubelet[1330]: I0822 04:00:58.104306 1330 server.go:794] GET /healthz: (5.286840082s) 200 [[curl/7.51.0] 127.0.0.1:35924]
Aug 22 04:00:58 gke-app-node1 kubelet[1330]: I0822 04:00:57.981923 1330 docker_server.go:73] Stop docker server
Aug 22 04:00:58 gke-app-node1 kubelet[1330]: I0822 04:00:53.991354 1330 server.go:794] GET /healthz: (5.834296581s) 200 [[Go-http-client/1.1] 127.0.0.1:35926]
Aug 22 04:00:57 gke-app-node1 kubelet[1330]: I0822 04:00:42.636036 1330 fsHandler.go:131] du and find on following dirs took 16.466105259s: [/var/lib/docker/overlay/e496194dfcb8a053050a0eb73965f57b109fe3036c1ffc5b0f12b4a341f13794 /var/lib/docker/containers/b6a212aedf588a1f1d173fd9f4871f678d014e260e8aa6147ad8212619675802]
Aug 22 04:00:39 gke-app-node1 kubelet[1330]: I0822 04:00:38.061492 1330 fsHandler.go:131] du and find on following dirs took 12.246559762s: [/var/lib/docker/overlay/303dc4c5814a0a12a6ac450e5b27327f55a7baa8000c011bd38521f3ff997e0f /var/lib/docker/containers/18a95beaf86b382bb8abc6ee40033020de1da4b54a5ca52e1c61bf7f14d6ef44]
Aug 22 04:00:39 gke-app-node1 kubelet[1330]: I0822 04:00:38.476930 1330 fsHandler.go:131] du and find on following dirs took 11.766408434s: [/var/lib/docker/overlay/86802dda255243388ab86fa8fc403187193f8c4ccdee54d8ca18c021ca35bc36 /var/lib/docker/containers/7fd4d507ec6445035fcb4a60efd4ae68e54052c1cace3268be72954062fed830]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:35.865293 1330 prober.go:106] Readiness probe for "web-deployment-812924635-ntcqw_default(bcf76fb6-8661-11e7-88da-42010a840211):rails-app" failed (failure): Get http://10.48.1.7:80/health_check: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: W0822 04:00:35.380953 1330 prober.go:98] No ref for container "docker://e2dcca90c091d2789af9b22e1405cb273f63c399aecde2686ef4b1e8ab9fdc5f" (web-deployment-812924635-ntcqw_default(bcf76fb6-8661-11e7-88da-42010a840211):rails-app)
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:32.514573 1330 fsHandler.go:131] du and find on following dirs took 7.127181023s: [/var/lib/docker/overlay/647f419bce585a3d0f5792376b269704cb358828bc5c4fb5e815bfa23950d511 /var/lib/docker/containers/59f7ada601f38a243daa7154f2ed27790d14d163c4675e26186d9a6d9db0272e]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:32.255644 1330 fsHandler.go:131] du and find on following dirs took 6.72357089s: [/var/lib/docker/overlay/992a65b68531c5ac53e4cd06f7a8f8abe4b908d943b5b9cc38da126b469050b2 /var/lib/docker/containers/2be7aede380d6f3452a5abacc53f9e0a69f8c5ee3dbdf5351a30effdf2d47833]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:32.067601 1330 fsHandler.go:131] du and find on following dirs took 6.511405985s: [/var/lib/docker/overlay/7bc4e00d232b4a22eb64a87ad079970aabb24bde17d3adaa6989840ebc91b96c /var/lib/docker/containers/949c778861a4f86440c5dd21d4daf40e97fb49b9eb1498111d7941ca3e63541a]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:31.907928 1330 fsHandler.go:131] du and find on following dirs took 6.263993478s: [/var/lib/docker/overlay/303abc540335a9ce7077fd21182845fbff2f06ed9eb1ac8af9effdfd048153b5 /var/lib/docker/containers/6544add2796f365d67d72fe283e083042aa2af82862eb6335295d228efa28d61]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:31.907845 1330 fsHandler.go:131] du and find on following dirs took 7.630063774s: [/var/lib/docker/overlay/a36c376a7ddd04c168822770866d8c34499ddec7e4039ada579b3d65adc57347 /var/lib/docker/containers/6a606a6c901f8373dff22f94ba77a24956a7b4eed3d0e550be168eeaeed86236]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:31.902731 1330 fsHandler.go:131] du and find on following dirs took 6.259025553s: [/var/lib/docker/overlay/0a7170e1a42bfa8b0112d8c7bb805da8e4778aa5ce90978d90ed5335929633ff /var/lib/docker/containers/1f68eaa59cab0a0bcdc087e25d18573044b599967a56867d189acd82bc19172b]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:31.871796 1330 fsHandler.go:131] du and find on following dirs took 6.410999589s: [/var/lib/docker/overlay/25ffbf8bd71e814af8991cc52499286d2d345b3f348fec9358ca366f341ed588 /var/lib/docker/containers/efe1969587c9b0412fe7f7c8c24bbe1326d46f576bddf12f88ae7cd406b6475d]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:31.871699 1330 fsHandler.go:131] du and find on following dirs took 6.259940483s: [/var/lib/docker/overlay/56909c00ec20b59c1fcb4988cd51fe50ebb467681f37bab3f9061d76993565bc /var/lib/docker/containers/a8d1df672c23313313b511389f6eeb44e78c3f9e4c346d214fc190695f270e5f]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:31.614518 1330 fsHandler.go:131] du and find on following dirs took 5.982313751s: [/var/lib/docker/overlay/cb057acc4f3a3e91470847f78ffd550b25a24605cec42ee080aaf193933968cf /var/lib/docker/containers/e755c4d88e4e5d4d074806e829b1e83fd52c8e2b1c01c27131222a40b0c6c10a]
Aug 22 04:00:35 gke-app-node1 kubelet[1330]: I0822 04:00:31.837000 1330 fsHandler.go:131] du and find on following dirs took 7.500602734s: [/var/lib/docker/overlay/e9539d9569ccdcc79db1cd4add7036d70ad71391dc30ca16903bdd9bda4d0972 /var/lib/docker/containers/b0a7c955af1ed85f56aeaed1d787794d5ffd04c2a81820465a1e3453242c8a19]
Aug 22 04:00:34 gke-app-node1 kubelet[1330]: I0822 04:00:31.836947 1330 fsHandler.go:131] du and find on following dirs took 6.257091389s: [/var/lib/docker/overlay/200f0f063157381d25001350c34914e020ea16b3f82f7bedf7e4b01d34e513a7 /var/lib/docker/containers/eca7504b7e24332381e459a2f09acc150a5681c148cebc5867ac66021cbe0435]
Aug 22 04:00:33 gke-app-node1 kubelet[1330]: I0822 04:00:31.836787 1330 fsHandler.go:131] du and find on following dirs took 7.286756684s: [/var/lib/docker/overlay/37334712f505b11c7f0b27fb0580eadc0e79fc789dcfafbea1730efd500fb69c /var/lib/docker/containers/4858388c53032331868497859110a7267fef95110a7ab3664aa857a21ee02a3e]
Aug 22 04:00:22 gke-app-node1 kubelet[1330]: I0822 04:00:21.999930 1330 qos_container_manager_linux.go:286] [ContainerManager]: Updated QoS cgroup configuration
Aug 22 04:00:20 gke-app-node1 kubelet[1330]: I0822 04:00:19.598974 1330 server.go:794] GET /healthz: (136.991429ms) 200 [[curl/7.51.0] 127.0.0.1:35888]
Aug 22 04:00:10 gke-app-node1 kubelet[1330]: I0822 04:00:08.024328 1330 server.go:794] GET /healthz: (36.191534ms) 200 [[curl/7.51.0] 127.0.0.1:35868]
Aug 22 04:00:05 gke-app-node1 kubelet[1330]: I0822 04:00:05.861339 1330 server.go:794] GET /stats/summary/: (808.201834ms) 200 [[Go-http-client/1.1] 10.48.0.7:43022]
Aug 22 04:00:03 gke-app-node1 kubelet[1330]: W0822 04:00:02.723586 1330 raw.go:87] Error while processing event ("/sys/fs/cgroup/memory/system.slice/kube-logrotate.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/system.slice/kube-logrotate.service: no such file or directory
Aug 22 04:00:03 gke-app-node1 kubelet[1330]: W0822 04:00:02.723529 1330 raw.go:87] Error while processing event ("/sys/fs/cgroup/blkio/system.slice/kube-logrotate.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/system.slice/kube-logrotate.service: no such file or directory
Aug 22 04:00:03 gke-app-node1 kubelet[1330]: W0822 04:00:02.622765 1330 raw.go:87] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/kube-logrotate.service": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/system.slice/kube-logrotate.service: no such file or directory
sudo dmesg -T
[Tue Aug 22 04:17:31 2017] cbr0: port 4(vethd18b0189) entered disabled state
[Tue Aug 22 04:17:31 2017] device vethd18b0189 left promiscuous mode
[Tue Aug 22 04:17:31 2017] cbr0: port 4(vethd18b0189) entered disabled state
[Tue Aug 22 04:17:40 2017] cbr0: port 6(veth2985149d) entered disabled state
[Tue Aug 22 04:17:40 2017] device veth2985149d left promiscuous mode
[Tue Aug 22 04:17:40 2017] cbr0: port 6(veth2985149d) entered disabled state
[Tue Aug 22 04:17:42 2017] cbr0: port 5(veth2a1d2827) entered disabled state
[Tue Aug 22 04:17:42 2017] device veth2a1d2827 left promiscuous mode
[Tue Aug 22 04:17:42 2017] cbr0: port 5(veth2a1d2827) entered disabled state
[Tue Aug 22 04:17:42 2017] cbr0: port 2(vetha070fbca) entered disabled state
[Tue Aug 22 04:17:42 2017] device vetha070fbca left promiscuous mode
[Tue Aug 22 04:17:42 2017] cbr0: port 2(vetha070fbca) entered disabled state
[Tue Aug 22 04:17:42 2017] cbr0: port 3(veth7e3e663a) entered disabled state
[Tue Aug 22 04:17:42 2017] device veth7e3e663a left promiscuous mode
[Tue Aug 22 04:17:42 2017] cbr0: port 3(veth7e3e663a) entered disabled state
[Tue Aug 22 04:17:57 2017] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[Tue Aug 22 04:17:57 2017] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue Aug 22 04:17:57 2017] device veth215e85fc entered promiscuous mode
[Tue Aug 22 04:17:57 2017] cbr0: port 1(veth215e85fc) entered forwarding state
[Tue Aug 22 04:17:57 2017] cbr0: port 1(veth215e85fc) entered forwarding state
[Tue Aug 22 04:18:12 2017] cbr0: port 1(veth215e85fc) entered forwarding state
最后在这里我们可以看到 Kubernetes pod 是如何在那个时候被杀死的:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d7b59cd7116c gcr.io/google_containers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_worker-deployment-3042507874-ghjpd_default_1de2291b-86ef-11e7-88da-42010a840211_0
e6f2563ac3c8 5e65193af899 "/monitor --component" 4 hours ago Up 4 hours k8s_prometheus-to-sd-exporter_fluentd-gcp-v2.0-5vlrc_kube-system_6cebabe5-84f9-11e7-88da-42010a840211_1
45d7886d0308 487e99ee05d9 "/bin/sh -c '/run.sh " 4 hours ago Up 4 hours k8s_fluentd-gcp_fluentd-gcp-v2.0-5vlrc_kube-system_6cebabe5-84f9-11e7-88da-42010a840211_1
b5a5c5d085ac gcr.io/google_containers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_fluentd-gcp-v2.0-5vlrc_kube-system_6cebabe5-84f9-11e7-88da-42010a840211_1
32dcd4d5847c 54d2a8698e3c "/bin/sh -c 'echo -99" 4 hours ago Up 4 hours k8s_kube-proxy_kube-proxy-gke-app-node1_kube-system_ed40100d42c9e285fa1f59ca7a1d8f8d_1
2d055d96b610 gcr.io/google_containers/pause-amd64:3.0 "/pause" 4 hours ago Up 4 hours k8s_POD_kube-proxy-gke-app-node1_kube-system_ed40100d42c9e285fa1f59ca7a1d8f8d_1
2be7aede380d 5e65193af899 "/monitor --component" 2 days ago Exited (0) 4 hours ago k8s_prometheus-to-sd-exporter_fluentd-gcp-v2.0-5vlrc_kube-system_6cebabe5-84f9-11e7-88da-42010a840211_0
7fd4d507ec64 54d2a8698e3c "/bin/sh -c 'echo -99" 2 days ago Exited (0) 4 hours ago k8s_kube-proxy_kube-proxy-gke-app-node1_kube-system_ed40100d42c9e285fa1f59ca7a1d8f8d_0
cc615ec1e87c efe10ee6727f "/bin/touch /run/xtab" 2 days ago Exited (0) 2 days ago k8s_touch-lock_kube-proxy-gke-app-node1_kube-system_ed40100d42c9e285fa1f59ca7a1d8f8d_0
b0a7c955af1e 487e99ee05d9 "/bin/sh -c '/run.sh " 2 days ago Exited (0) 4 hours ago k8s_fluentd-gcp_fluentd-gcp-v2.0-5vlrc_kube-system_6cebabe5-84f9-11e7-88da-42010a840211_0
4858388c5303 gcr.io/google_containers/pause-amd64:3.0 "/pause" 2 days ago Exited (0) 4 hours ago k8s_POD_fluentd-gcp-v2.0-5vlrc_kube-system_6cebabe5-84f9-11e7-88da-42010a840211_0
6a606a6c901f gcr.io/google_containers/pause-amd64:3.0 "/pause" 2 days ago Exited (0) 4 hours ago k8s_POD_kube-proxy-gke-app-node1_kube-system_ed40100d42c9e285fa1f59ca7a1d8f8d_0
我们的集群正在为主池和节点池运行 kubernetes 1.7.3,它位于 GKE(europe-west1-d 区域)上。
任何帮助将不胜感激,因为我们真的不知道如何进一步调试这个问题。