我正面临一个问题,即启用了 keepalive 的 TCP 套接字在没有明显原因的情况下被重置。
定义的保活设置如下:
- tcp_keepalive_time = 20 sec
- tcp_keepalive_probes = 3
- tcp_keepalive_intvl = 3 sec
- tcp_user_timeout = 20 sec
RST 数据包在最后一次 keepalive ACK 探测(时间等于 tcp_keepalive_intvl)后 3 秒被传输,如下所示:
193986 2021-11-08 09:25:42.082749 10.5.40.37 10.5.40.38 TCP 154 64 57238 → 55493 [PSH, ACK] Seq=16767 Ack=8113 Win=3650 Len=88 TSval=3370932349 TSecr=1741924624
193987 2021-11-08 09:25:42.083133 10.5.40.38 10.5.40.37 TCP 66 58 55493 → 57238 [ACK] Seq=8113 Ack=16855 Win=5068 Len=0 TSval=1741925586 TSecr=3370932349
193988 2021-11-08 09:25:42.083191 10.5.40.38 10.5.40.37 TCP 170 58 55493 → 57238 [PSH, ACK] Seq=8113 Ack=16855 Win=5068 Len=104 TSval=1741925586 TSecr=3370932349
193991 2021-11-08 09:25:42.125225 10.5.40.37 10.5.40.38 TCP 66 64 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370932391 TSecr=1741925586
196998 2021-11-08 09:26:03.649256 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3370953915 TSecr=1741925586
196999 2021-11-08 09:26:03.649841 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1741947152 TSecr=3370932391
197024 2021-11-08 09:26:03.929412 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1741947432 TSecr=3370932391
197025 2021-11-08 09:26:03.929454 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370954195 TSecr=1741947152
198851 2021-11-08 09:26:24.133227 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3370974398 TSecr=1741947152
198853 2021-11-08 09:26:24.133615 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1741967636 TSecr=3370954195
198872 2021-11-08 09:26:24.405168 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1741967907 TSecr=3370954195
198873 2021-11-08 09:26:24.405229 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370974670 TSecr=1741967636
201433 2021-11-08 09:26:44.609231 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3370994874 TSecr=1741967636
201434 2021-11-08 09:26:44.609595 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1741988111 TSecr=3370974670
201468 2021-11-08 09:26:44.888908 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1741988391 TSecr=3370974670
201469 2021-11-08 09:26:44.888957 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3370995153 TSecr=1741988111
204434 2021-11-08 09:27:05.089228 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3371015353 TSecr=1741988111
204435 2021-11-08 09:27:05.089619 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1742008591 TSecr=3370995153
204460 2021-11-08 09:27:05.364662 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1742008866 TSecr=3370995153
204461 2021-11-08 09:27:05.364703 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3371015629 TSecr=1742008591
206324 2021-11-08 09:27:25.573262 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive] 57238 → 55493 [ACK] Seq=16854 Ack=8217 Win=3650 Len=0 TSval=3371035837 TSecr=1742008591
206325 2021-11-08 09:27:25.574628 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive ACK] 55493 → 57238 [ACK] Seq=8217 Ack=16855 Win=5068 Len=0 TSval=1742029075 TSecr=3371015629
206358 2021-11-08 09:27:25.844446 10.5.40.38 10.5.40.37 TCP 66 58 [TCP Keep-Alive] 55493 → 57238 [ACK] Seq=8216 Ack=16855 Win=5068 Len=0 TSval=1742029346 TSecr=3371015629
206359 2021-11-08 09:27:25.844481 10.5.40.37 10.5.40.38 TCP 66 64 [TCP Keep-Alive ACK] 57238 → 55493 [ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3371036108 TSecr=1742008591
206568 2021-11-08 09:27:28.642162 10.5.40.37 10.5.40.38 TCP 66 64 57238 → 55493 [RST, ACK] Seq=16855 Ack=8217 Win=3650 Len=0 TSval=3371038906 TSecr=1742008591
检查相关代码(tcp_keepalive_timer()函数)我无法找到导致这种情况的路径。
预计在等于 tcp_keepalive_intvl 的时间之后发送数据包的唯一情况是重新传输之前没有收到 ACK 的 keepalive 探测,这在这种情况下不适用。
另一方面,根据上述代码,如果满足以下条件,则预期 RST 数据包:
if (elapsed >= keepalive_time_when(tp)) { // check if time since last data >= 20 sec
/* If the TCP_USER_TIMEOUT option is enabled, use that
* to determine when to timeout instead.
*/
if ((icsk->icsk_user_timeout != 0 &&
elapsed >= icsk->icsk_user_timeout &&
icsk->icsk_probes_out > 0) ||
(icsk->icsk_user_timeout == 0 &&
icsk->icsk_probes_out >= keepalive_probes(tp))) {
tcp_send_active_reset(sk, GFP_ATOMIC);
tcp_write_err(sk);
goto out;
}
在这种情况下,这似乎也不成立。
为了描述的完整性,这两个节点运行在不同的、地理上分开的 ESXi 主机上。
任何关于可能导致上述行为的想法都将不胜感激。