0

我们在 Ubuntu 上运行 1.10 的 Kubernetes 集群中遇到了间歇性连接/dns 问题。

我们一直在查看错误报告/等,最近我们可以确定一个进程正在保留/run/xtables.lock,它导致 kube-proxy pod 出现问题。

绑定到工作人员的 kube-proxy pod 之一在日志中重复出现此错误:

E0920 13:39:42.758280       1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 13:46:46.193919       1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:05:45.185720       1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:11:52.455183       1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:38:36.213967       1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
E0920 14:44:43.442933       1 proxier.go:647] Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.

这些错误大约在 3 周前开始发生,到目前为止我们一直无法纠正。因为问题是间歇性的,我们直到现在才追查到这个问题。

我们认为这导致其中一个 kube-flannel-ds pod 也处于永久CrashLoopBackOff状态:

NAME                                 READY     STATUS             RESTARTS   AGE
coredns-78fcdf6894-6z6rs             1/1       Running            0          40d
coredns-78fcdf6894-dddqd             1/1       Running            0          40d
etcd-k8smaster1                      1/1       Running            0          40d
kube-apiserver-k8smaster1            1/1       Running            0          40d
kube-controller-manager-k8smaster1   1/1       Running            0          40d
kube-flannel-ds-amd64-sh5gc          1/1       Running            0          40d
kube-flannel-ds-amd64-szkxt          0/1       CrashLoopBackOff   7077       40d
kube-proxy-6pmhs                     1/1       Running            0          40d
kube-proxy-d7d8g                     1/1       Running            0          40d
kube-scheduler-k8smaster1            1/1       Running            0          40d

大多数错误报告/run/xtables.lock似乎表明它已在 2017 年 7 月解决,但我们在新设置中看到了这一点。我们似乎在 iptables 中有适当的链配置。

运行fuser /run/xtables.lock什么也不返回。

有人对此有见识吗?它造成了很多痛苦

4

1 回答 1

3

因此,经过更多挖掘,我们能够使用以下命令找到原因代码:

kubectl -n kube-system describe pods kube-flannel-ds-amd64-szkxt

pod 的名称当然会在不同的安装中更改,但终止的原因代码输出为:

   Last State:     Terminated
     Reason:       OOMKilled
     Exit Code:    137

我们之前错过了这个原因代码(我们主要关注 137 的退出代码),这意味着内存不足;被杀。

默认情况下,kube-flannel-ds 获得的最大内存分配为100Mi- 这显然太低了。在参考配置中更改此默认值时记录了其他问题,但我们的修复是将最大限制调整为256Mi

更改配置是一步,只需发出:

kubectl -n kube-system edit ds kube-flannel-ds-amd64

并将值从100Mi限制 -> 内存更改为更高的值;我们做到了256Mi

默认情况下,这些 pod 只会更新OnDelete,因此您需要删除 a 中的 pod CrashLoopBackOff,之后将使用更新的值重新创建它。

我想您也可以滚动并删除其他节点上的任何节点,但我们只删除了一直失败的节点。

以下是对一些帮助我们追踪的问题的参考:

https://github.com/coreos/flannel/issues/963 https://github.com/coreos/flannel/issues/1012

于 2018-09-27T13:23:04.950 回答