3

I've got a Next.js app which has 2 simple readiness and liveness endpoints with the following implementation:

return res.status(200).send('OK');

I've created the endpoints as per the api routes docs. Also, I've got a /stats basePath as per the docs here. So, the probes endpoints are at /stats/api/readiness and /stats/api/liveness.

When I build and run the app in a Docker container locally - the probe endpoints are accessible and returning 200 OK.

When I deploy the app to my k8s cluster, though, the probes fail. There's plenty of initialDelaySeconds time, so that's not the cause.

I connect to the service of the pod thru port-forward and when the pod has just started, before it fails, I can hit the endpoint and it returns 200 OK. And a bit after it starts failing as usual.

I also tried accessing the failing pod thru a healthy pod:

k exec -t [healthy pod name] -- curl -l 10.133.2.35:8080/stats/api/readiness

And the same situation - in the beginning, while the pod hasn't failed yet, I get 200 OK on the curl command. And a bit after, it start failing.

The error on the probes that I get is:

Readiness probe failed: Get http://10.133.2.35:8080/stats/api/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Funny experiment - I tried putting a random, non-existent endpoint for the probes, and I get the same error. Which leads me to the thought that the probes fail because it cannot access the proper endpoints?

But then again, the endpoints are accessible for a period of time before the probes start failing. So, I have literally no idea why this is happening.

Here is my k8s deployment config for the probes:

      livenessProbe:
        httpGet:
          path: /stats/api/liveness
          port: 8080
          scheme: HTTP
        initialDelaySeconds: 10
        timeoutSeconds: 3
        periodSeconds: 3
        successThreshold: 1
        failureThreshold: 5
      readinessProbe:
        httpGet:
          path: /stats/api/readiness
          port: 8080
          scheme: HTTP
        initialDelaySeconds: 10
        timeoutSeconds: 3
        periodSeconds: 3
        successThreshold: 1
        failureThreshold: 3

Update

used curl -v as requested from comments. The result is:

*   Trying 10.133.0.12:8080...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.133.0.12 (10.133.0.12) port 8080 (#0)
> GET /stats/api/healthz HTTP/1.1
> Host: 10.133.0.12:8080
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< ETag: "2-nOO9QiTIwXgNtWtBJezz8kv3SLc"
< Content-Length: 2
< Date: Wed, 16 Jun 2021 18:42:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
{ [2 bytes data]
100     2  100     2    0     0    666      0 --:--:-- --:--:-- --:--:--   666
* Connection #0 to host 10.133.0.12 left intact
OK%

Then, ofcourse, once it starts failing, the result is:

*   Trying 10.133.0.12:8080...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* connect to 10.133.0.12 port 8080 failed: Connection refused
* Failed to connect to 10.133.0.12 port 8080: Connection refused
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
curl: (7) Failed to connect to 10.133.0.12 port 8080: Connection refused
command terminated with exit code 7
4

1 回答 1

6

错误告诉你:Client.Timeout exceeded while awaiting headers. 表示 TCP 连接已建立(未拒绝,也未超时)。

您的 liveness/readiness 探测超时太低。您的应用程序没有足够的时间做出响应。

可能是由于 CPU 或内存分配比使用笔记本电脑时小,由于更高的并发性,可能是在您没有设置一些默认值时设置了一些默认值的 LimitRange。

检查:

time kubectl exec -t [healthy pod name] -- curl -l 127.0.0.1:8080/stats/api/readiness

如果您无法分配更多 CPU,则将时间加倍,四舍五入并修复您的探针:

  livenessProbe:
    ...
    timeoutSeconds: 10

  readinessProbe:
    ...
    timeoutSeconds: 10

或者,尽管精神上可能不太好,但您可以用 tcpSocket 检查替换那些 httpGet 检查。它们会更快,但可能会错过实际问题。

于 2021-06-16T19:10:20.877 回答