0

我正面临一个问题,即启用了 keepalive 的 TCP 套接字在没有明显原因的情况下被重置。

定义的保活设置如下:

- tcp_keepalive_time = 20 sec
- tcp_keepalive_probes = 3
- tcp_keepalive_intvl = 3 sec
- tcp_user_timeout = 20 sec

RST 数据包在最后一次 keepalive ACK 探测(时间等于 tcp_keepalive_intvl)后 3 秒被传输,如下所示:

193986  2021-11-08 09:25:42.082749        10.5.40.37            10.5.40.38            TCP        154         64           57238 → 55493 [PSH, ACK] Seq=16767 Ack=8113 Win=3650 Len=88 TSval=3370932349 TSecr=1741924624
193987  2021-11-08 09:25:42.083133        10.5.40.38            10.5.40.37            TCP        66           58           55493 → 57238 [ACK] Seq=8113 Ack=16855 Win=5068 Len=0 TSval=1741925586 TSecr=3370932349
193988  2021-11-08 09:25:42.083191        10.5.40.38            10.5.40.37            TCP        170         58           55493 → 57238 [PSH, ACK] Seq=8113 Ack=16855 Win=5068 Len=104 TSval=1741925586 TSecr=3370932349
193991  2021-11-08 09:25:42.125225        10.5.40.37            10.5.40.38            TCP        66           64           57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370932391 TSecr=1741925586
196998  2021-11-08 09:26:03.649256        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3370953915 TSecr=1741925586
196999  2021-11-08 09:26:03.649841        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1741947152 TSecr=3370932391
197024  2021-11-08 09:26:03.929412        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1741947432 TSecr=3370932391
197025  2021-11-08 09:26:03.929454        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370954195 TSecr=1741947152
198851  2021-11-08 09:26:24.133227        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3370974398 TSecr=1741947152
198853  2021-11-08 09:26:24.133615        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1741967636 TSecr=3370954195
198872  2021-11-08 09:26:24.405168        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1741967907 TSecr=3370954195
198873  2021-11-08 09:26:24.405229        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370974670 TSecr=1741967636
201433  2021-11-08 09:26:44.609231        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3370994874 TSecr=1741967636
201434  2021-11-08 09:26:44.609595        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1741988111 TSecr=3370974670
201468  2021-11-08 09:26:44.888908        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1741988391 TSecr=3370974670
201469  2021-11-08 09:26:44.888957        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370995153 TSecr=1741988111
204434  2021-11-08 09:27:05.089228        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3371015353 TSecr=1741988111
204435  2021-11-08 09:27:05.089619        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1742008591 TSecr=3370995153
204460  2021-11-08 09:27:05.364662        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1742008866 TSecr=3370995153
204461  2021-11-08 09:27:05.364703        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3371015629 TSecr=1742008591
206324  2021-11-08 09:27:25.573262        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3371035837 TSecr=1742008591
206325  2021-11-08 09:27:25.574628        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1742029075 TSecr=3371015629
206358  2021-11-08 09:27:25.844446        10.5.40.38            10.5.40.37            TCP        66           58           [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1742029346 TSecr=3371015629
206359  2021-11-08 09:27:25.844481        10.5.40.37            10.5.40.38            TCP        66           64           [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3371036108 TSecr=1742008591
206568  2021-11-08 09:27:28.642162        10.5.40.37            10.5.40.38            TCP        66           64           57238 → 55493 [RST, ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3371038906 TSecr=1742008591

检查相关代码(tcp_keepalive_timer()函数)我无法找到导致这种情况的路径。

预计在等于 tcp_keepalive_intvl 的时间之后发送数据包的唯一情况是重新传输之前没有收到 ACK 的 keepalive 探测,这在这种情况下不适用。

另一方面,根据上述代码,如果满足以下条件,则预期 RST 数据包:

if (elapsed >= keepalive_time_when(tp)) { // check if time since last data >= 20 sec
  /* If the TCP_USER_TIMEOUT option is enabled, use that
   * to determine when to timeout instead.
   */
  if ((icsk->icsk_user_timeout != 0 &&
    elapsed >= icsk->icsk_user_timeout &&
    icsk->icsk_probes_out > 0) ||
    (icsk->icsk_user_timeout == 0 &&
    icsk->icsk_probes_out >= keepalive_probes(tp))) {
      tcp_send_active_reset(sk, GFP_ATOMIC);
      tcp_write_err(sk);
      goto out;

  }

在这种情况下,这似乎也不成立。

为了描述的完整性,这两个节点运行在不同的、地理上分开的 ESXi 主机上。

任何关于可能导致上述行为的想法都将不胜感激。

4

0 回答 0