3

I can't get Ingress to work on GKE, owing to health check failures. I've tried all of the debugging steps I can think of, including:

  • Verified I'm not running low on any quotas
  • Verified that my service is accessible from within the cluster
  • Verified that my service works behind a k8s/GKE Load Balancer.
  • Verified that healthz checks are passing in Stackdriver logs

... I'd love any advice about how to debug or fix. Details below!


I have set up a service with type LoadBalancer on GKE. Works great via external IP:

apiVersion: v1
kind: Service
metadata:
  name: echoserver
  namespace: es
spec:
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  type: LoadBalancer
  selector:
    app: echoserver

Then I try setting up an Ingress on top of this same service:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: echoserver-ingress
  namespace: es
  annotations:
    kubernetes.io/ingress.class: "gce"
    kubernetes.io/ingress.global-static-ip-name: "echoserver-global-ip"
spec:
  backend:
    serviceName: echoserver
    servicePort: 80

The Ingress gets created, but it thinks the backend nodes are unhealthy:

$ kubectl --namespace es describe ingress echoserver-ingress | grep backends
  backends:     {"k8s-be-31102--<snipped>":"UNHEALTHY"}

Inspecting the state of the Ingress backend in the GKE web console, I see the same thing:

0 of 3 healthy

The health check details appear as expected:

health check details

... and from within a pod in my cluster I can call the service successfully:

# curl  -vvv echoserver  2>&1 | grep "< HTTP"
< HTTP/1.0 200 OK

# curl  -vvv echoserver/healthz  2>&1 | grep "< HTTP"
< HTTP/1.0 200 OK

And I can address the service by NodePort:

# curl  -vvv 10.0.1.1:31102  2>&1 | grep "< HTTP" 
< HTTP/1.0 200 OK

(Which goes without saying, because the Load Balancer service I set up in step 1 resulted in a web site that's working just fine.)

I also see healthz checks passing in Stackdriver logs:

enter image description here

Regarding quotas, I check and see I'm only using 3 of 30 backend services:

$ gcloud compute project-info describe | grep -A 1 -B 1  BACKEND_SERVICES
- limit: 30.0
  metric: BACKEND_SERVICES
  usage: 3.0
4

3 回答 3

2

几周前有一个类似的问题。为我解决的问题是在服务描述中添加一个 NodePort,以便 Google Cloud Loadbalancer 可以探测这个 NodePort。对我有用的配置:

apiVersion: v1
kind: Service
metadata: 
  name: some-service
spec: 
  selector: 
    name: some-app
  type: NodePort
  ports: 
    - port: 80
      targetPort: 8080
      nodePort: 32000
      protocol: TCP

入口可能需要一些时间才能接收到这一点。您可以重新创建入口以加快速度。

于 2017-09-20T20:24:16.150 回答
0

You have configured the timeout value to be 1 second. Perhaps increasing it to 5 seconds will solve the issue.

于 2017-09-20T04:37:12.860 回答
0

我遇到了这个问题,最终遇到了https://stackoverflow.com/a/50645953/9276,这让我查看了我的防火墙设置。果然,我添加的最后几个 NodePort 服务没有在防火墙规则中启用,因此指向它们的入口的健康检查都失败了。手动将新主机端口添加到防火墙规则为我解决了这个问题。

但是,与链接的答案不同,我没有使用无效的证书。我猜还有其他错误或奇怪的状态会导致这种行为,但我还没有找到规则停止自动管理的原因。

可能不相关,我在我们的 qa 环境中没有这个问题,只是生产,所以可能有 GCP 项目级别的设置在起作用。

于 2018-08-24T17:43:58.110 回答