3

我有由 3 个节点组成的带有 weave CNI 插件的 kubernetes 集群:

  • 1个主节点(虚拟机)
  • 2 个工作裸机节点(4 核 Xeon 超线程 - 8 个逻辑节点)

问题top在于 kubelet 在第一个 worker 上的 CPU 使用率为 60-100%。在journalctl -u kubelet我看到很多消息(每分钟数百条)

May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.075243    3843 docker_sandbox.go:205] Failed to stop sandbox "011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640": Error response from daemon: {"message":"No such container: 011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640"}
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.075360    3843 remote_runtime.go:109] StopPodSandbox "011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-p6kwb_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.075380    3843 kuberuntime_gc.go:138] Failed to stop sandbox "011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-p6kwb_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 011cf10cf46dbc6bf2e11d1cb562af478eee21eba0c40521bf7af51ee5399640
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.076549    3843 docker_sandbox.go:205] Failed to stop sandbox "0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf": Error response from daemon: {"message":"No such container: 0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf"}
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.076654    3843 remote_runtime.go:109] StopPodSandbox "0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-6g8jq_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.076676    3843 kuberuntime_gc.go:138] Failed to stop sandbox "0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-6g8jq_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 0125de37634ef7f3aa852c999cfb5849750167b1e3d63293a085ceca416e4ebf
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.079585    3843 docker_sandbox.go:205] Failed to stop sandbox "014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772": Error response from daemon: {"message":"No such container: 014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772"}
May 19 09:57:38 kube-worker1 bash[3843]: E0519 09:57:38.079805    3843 remote_runtime.go:109] StopPodSandbox "014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "cron-task-2533948c46c1-r30cw_namespace" network: CNI failed to retrieve network namespace path: Error: No such container: 014135ede46ee45c176528da02782a38ded36bd10566f864c147ccb66a617772

这是在创建过程中失败的错误 crontes 任务之后发生的。我删除了所有 pod,--force但 kubelet 仍然尝试删除它们。此外,我在该工作人员上重新启动了 kubelet,但没有任何结果。我如何与 kubelet 交谈以忘记它们?

版本信息

Kubernetes v1.6.1
Docker version 1.12.0, build 8eab29e
Linux kube-worker1 4.4.0-72-generic #93-Ubuntu SMP

容器清单(没有元数据)

  job:
    apiVersion: batch/v1
    kind: Job
    spec:
      template:
        spec:
          containers:
          - name: cron-task
            image: docker.company.ru/image:v2.3.2
            command: ["rake", "db:refresh_views"]
            env:
            - name: RAILS_ENV
              value: namespace
            - name: CONFIG_PATH
              value: /config
            volumeMounts:
            - name: config
              mountPath: /config
          volumes:
          - name: config
            configMap:
              name: task-conf
          restartPolicy: Never

此外,我在集群的 etcd 中没有发现任何提及此 pod 的名称部分 (2533948c46c1)。

4

3 回答 3

5

最后我找到了解决方案。
Kubelet 存储有关所有 pod 的信息,在其上运行

/var/lib/dockershim/sandbox

因此,当我ls在该文件夹中时,我找到了所有丢失 pod 的文件。然后我删除了这些文件,日志消息消失了,CPU 使用率恢复正常值(即使没有重新启动 kubelet)

于 2017-05-25T06:46:01.940 回答
0

我遇到了和你一样的问题,并为此进行了分析,发现原因是 kubelet pleg 机制并删除了 '/var/lib/dockershim/sandbox' 起到了神奇的作用。

于 2017-06-17T14:04:27.073 回答
0

这似乎与在 Kubernetes 1.6.x 中使用 CNI 问题时无法删除 hostNetwork=true 的 Pod(并生成错误)有关。无论如何,这些消息并不重要,但是当您尝试查找实际问题时当然会很烦人。尝试使用最新版本的 Kubernetes 来缓解这些问题。

于 2017-05-19T10:50:01.307 回答