kubernetes - Kubernetes 1.15.5 和 romana 2.0.2 在添加或删除任何 pod 时出现网络错误

Question

我在我们的 Kubernetes 集群中遇到了一些神秘的网络错误。虽然我最初使用 ingress 时遇到了这些错误，但是当我绕过我们的负载均衡器、绕过 kube-proxy 和绕过 nginx-ingress 时，出现的错误更多。直接访问服务和直接访问 Pod IP 时出现的错误最多。我相信这是因为负载均衡器和 nginx 比原始 iptable 路由具有更好的错误处理能力。

为了测试错误，我在同一子网上使用来自 VM 的 apache 基准测试，任何并发级别，没有保持活动，连接到 pod IP 并使用足够高的请求数给我时间来扩大或缩小部署。奇怪的是，我修改哪个部署并不重要，因为它总是会导致相同的错误集，即使它与我正在修改的 pod 无关。任何添加或删除 pod 都会触发 apache 基准测试错误。手动删除、放大/缩小、自动缩放所有触发错误。如果在 ab 测试运行时没有 pod 更改，则不会报告错误。请注意，如果不能消除错误，keep-alive 似乎确实会大大减少错误，但我只测试了几次，从未发现错误。

除了一些奇怪的 iptable 冲突之外，我真的看不出删除 pod A 会如何影响 pod B 的网络连接。由于错误是短暂的并且会在几秒钟内消失，这似乎更像是短暂的网络中断。

样本 ab 测试：ab -n 5000 -c 2 https://10.112.0.24/

使用 HTTPS 时的错误：

SSL handshake failed (5).
SSL read failed (5) - closing connection

使用 HTTP 时的错误：

apr_socket_recv: Connection reset by peer (104)
apr_socket_recv: Connection refused (111)

示例 ab 输出。遇到第一个错误后我 ctl-C：

$ ab -n 5000 -c 2 https://10.112.0.24/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 10.112.0.24 (be patient)
Completed 500 requests
Completed 1000 requests
SSL read failed (5) - closing connection
Completed 1500 requests
^C

Server Software:        nginx
Server Hostname:        10.112.0.24
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256

Document Path:          /
Document Length:        2575 bytes

Concurrency Level:      2
Time taken for tests:   21.670 seconds
Complete requests:      1824
Failed requests:        2
   (Connect: 0, Receive: 0, Length: 1, Exceptions: 1)
Total transferred:      5142683 bytes
HTML transferred:       4694225 bytes
Requests per second:    84.17 [#/sec] (mean)
Time per request:       23.761 [ms] (mean)
Time per request:       11.881 [ms] (mean, across all concurrent requests)
Transfer rate:          231.75 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        5   15   9.8     12      82
Processing:     1    9   9.0      6     130
Waiting:        0    8   8.9      6     129
Total:          7   23  14.4     19     142

Percentage of the requests served within a certain time (ms)
  50%     19
  66%     24
  75%     28
  80%     30
  90%     40
  95%     54
  98%     66
  99%     79
 100%    142 (longest request)

当前可能相关的 sysctl 设置：

net.netfilter.nf_conntrack_tcp_be_liberal = 1
net.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_count = 1280
net.ipv4.ip_local_port_range = 27050    65500

我没有看到任何 conntrack “完整”错误。最好我能说没有丢包。我们最近从 1.14 升级并没有注意到这个问题，但我不能肯定它不存在。我相信我们很快就会被迫从 romana 迁移出去，因为它似乎不再被维护，而且当我们升级到 kube 1.16.x 时，我们在启动时遇到了问题。

我今天整天在互联网上搜索类似的问题，与我们的问题最接近的问题是https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02但我不知道如何实现 iptable masquerade --random-fully 选项，因为我们使用 romana 并且我读到（https://github.com/kubernetes/kubernetes/pull/78547#issuecomment-527578153）完全随机是我们正在使用的 linux 内核 5 的默认值。有任何想法吗？

Kubernetes 1.15.5
罗马2.0.2
centos7
Linux kube-master01 5.0.7-1.el7.elrepo.x86_64 #1 SMP Fri Apr 5 18:07:52 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

====== 2019 年 11 月 5 日更新 ======

有人建议测试备用 CNI。我选择了 calico，因为我们在旧的基于 Debian 的 kube 集群中使用了它。我使用我们最基本的 Centos 7 模板 (vSphere) 重建了一个虚拟机，所以我们的自定义有一些包袱。我无法列出我们在模板中自定义的所有内容，但最显着的变化是内核 5 升级yum --enablerepo=elrepo-kernel -y install kernel-ml。

启动 VM 后，这些是安装 kubernetes 和运行测试的最少步骤：

yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

yum -y install docker-ce-3:18.09.6-3.el7.x86_64

systemctl start docker

cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF

# Set SELinux in permissive mode (effectively disabling it)
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables

yum install -y kubeadm-1.15.5-0 kubelet-1.15.5-0 kubectl-1.15.5-0

systemctl enable --now kubelet

kubeadm init --pod-network-cidr=192.168.0.0/16

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

kubectl taint nodes --all node-role.kubernetes.io/master-

kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml

cat <<EOF > /tmp/test-deploy.yml
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
  name: test
spec:
  selector:
    matchLabels:
      app: test
  replicas: 1
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: nginx
        image: nginxdemos/hello
        ports:
        - containerPort: 80
EOF

# wait for control plane to become healthy

kubectl apply -f /tmp/test-deploy.yml

现在设置已准备就绪，这是 ab 测试：

$ docker run --rm jordi/ab -n 100 -c 1  http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.4.4 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 11 requests completed

此错误后 ab 测试放弃。如果我减少请求的数量以避免超时，您会看到：

$ docker run --rm jordi/ab -n 10 -c 1  http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.4.4 (be patient).....done


Server Software:        nginx/1.13.8
Server Hostname:        192.168.4.4
Server Port:            80

Document Path:          /
Document Length:        7227 bytes

Concurrency Level:      1
Time taken for tests:   0.029 seconds
Complete requests:      10
Failed requests:        0
Total transferred:      74140 bytes
HTML transferred:       72270 bytes
Requests per second:    342.18 [#/sec] (mean)
Time per request:       2.922 [ms] (mean)
Time per request:       2.922 [ms] (mean, across all concurrent requests)
Transfer rate:          2477.50 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.8      1       3
Processing:     1    2   1.2      1       4
Waiting:        0    1   1.3      0       4
Total:          1    3   1.4      3       5

Percentage of the requests served within a certain time (ms)
  50%      3
  66%      3
  75%      4
  80%      5
  90%      5
  95%      5
  98%      5
  99%      5
 100%      5 (longest request)

这个问题在技术上与我报告的原始问题不同，但这是一个不同的 CNI，仍然存在网络问题。当我在 kube/romana 集群中运行相同的测试时，它确实存在超时错误：在与 pod 相同的节点上运行 ab 测试。两者都遇到了相同的超时错误，但在 romana 中，我可以在超时之前完成数千个请求。Calico 在达到十几个请求之前遇到超时错误。

其他变体或注释： - net.netfilter.nf_conntrack_tcp_be_liberal=0/1 似乎没有什么不同 - 更高的-n值有时会起作用，但它在很大程度上是随机的。-n- 连续多次以低值运行“ab”测试有时会触发超时

在这一点上，我很确定我们的 centos 安装存在一些问题，但我什至无法猜测它可能是什么。是否有任何其他限制、sysctl 或其他配置可能导致此问题？

====== 2019 年 11 月 6 日更新 ======

我观察到我们安装了较旧的内核，因此我使用相同的较新内核 5.3.8-1.el7.elrepo.x86_64 升级了我的 kube/calico 测试 VM。更新和几次重新启动后，我无法再重现“apr_pollset_poll：指定的超时已过期（70007）”超时错误。

现在超时错误消失了，我可以重复原始测试，在我的 vSphere VM 上加载测试 pod A 并杀死 pod B。在 romana 环境中，问题仍然存在，但仅当负载测试位于与 pod A 所在位置不同的主机上时。如果我在同一台主机上运行测试，则完全没有错误。使用 Calico 而不是 romana，两台主机上都没有负载测试错误，因此问题消失了。可能仍有一些设置需要调整，可以帮助 romana，但我认为这是 romana 的“罢工 3”，所以我将开始将完整的环境过渡到 Calico，并在那里进行一些验收测试，以确保没有隐藏的陷阱。

score 0 · Accepted Answer

您提到如果在运行 ab 测试时没有 pod 更改，则不会报告任何错误。所以这意味着当你添加或删除一个 pod 时会发生错误。

这是 pod 被删除时的正常行为；传播 iptable 规则更改需要时间。可能会发生容器被删除，但 iptable 规则尚未更改而结束数据包被转发到不存在的容器的情况，这会导致错误（这有点像竞争条件）。

您可以做的第一件事始终是创建readiness probe，因为它将确保在准备好处理请求之前不会将流量转发到容器。

第二件事是正确处理删除容器。这是一项艰巨的任务，因为它可能会在多个级别上进行处理，但是您可以做的最简单的事情就是将 PreStop 挂钩添加到您的容器中，如下所示：

lifecycle:
 preStop:
  exec:
   command:
   - sh
   - -c
   - "sleep 5"

PreStop hook在 pod 删除请求时执行。从这一刻起，k8s 开始更改 iptable 规则，它应该停止将新流量转发到即将被删除的容器。在sleep时，您需要一些时间让 k8s 在集群中传播 iptable 更改，同时不会中断已经存在的连接。退出后PreStop handle，容器会收到 SIGTERM 信号。

我的建议是将这两种机制一起应用并检查它是否有帮助。

您还提到绕过入口会导致更多错误。我认为这是由于入口已经实施了重试机制。如果它无法打开与容器的连接，它会尝试多次，并希望能到达可以处理其请求的容器。

kubernetes - Kubernetes 1.15.5 和 romana 2.0.2 在添加或删除任何 pod 时出现网络错误

1 回答 1

Related

Reference