0

我们的后端服务器使用 Embedded Jetty 8.1.15 已有几年了。直到最近,当我们开始对大量并发用户进行负载测试场景时,它才出现任何问题。然后,即使有少量用户使用 JMeter(具有 1000 个并发池和 KeepAlive 的 HTTP 采样器),我们也成功地重现了该问题。客户端和服务器之间的通信是通过 TLS(在客户端有连接池)客户端(连接池) -> TLS -> 服务器我们看到的行为 - 在某个时间点,很多线程“卡住”(等待在不同方法的监视器上)具有以下堆栈跟踪 -

Thread "qtp438123546-99":
at java.security.SecureRandom.nextBytes(byte[ ]) (line: 457)
at sun.security.ssl.RandomCookie.<init>(java.security.SecureRandom) (line: 53)
at sun.security.ssl.ServerHandshaker.clientHello(sun.security.ssl.HandshakeMessage$ClientHello) (line: 522)
at sun.security.ssl.ServerHandshaker.processMessage(byte, int) (line: 213)
at sun.security.ssl.Handshaker.processLoop() (line: 925)
at sun.security.ssl.Handshaker$1.run() (line: 865)
at sun.security.ssl.Handshaker$1.run() (line: 862)
at java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext)
at sun.security.ssl.Handshaker$DelegatedTask.run() (line: 1302)
at org.eclipse.jetty.io.nio.SslConnection.process(org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 375)
at org.eclipse.jetty.io.nio.SslConnection.access$900(org.eclipse.jetty.io.nio.SslConnection, org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 48)
at org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.fill(org.eclipse.jetty.io.Buffer) (line: 678)
at org.eclipse.jetty.http.HttpParser.fill() (line: 1044)
at org.eclipse.jetty.http.HttpParser.parseNext() (line: 280)
at org.eclipse.jetty.http.HttpParser.parseAvailable() (line: 235)
at org.eclipse.jetty.server.AsyncHttpConnection.handle() (line: 82)
at org.eclipse.jetty.io.nio.SslConnection.handle() (line: 196)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle() (line: 696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run() (line: 53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(java.lang.Runnable) (line: 608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run() (line: 543)
at java.lang.Thread.run() (line: 745)

或者有时在其他 Java 安全同步 API 方法上

Thread "qtp438123546-993":
at sun.security.ssl.SignatureAndHashAlgorithm.getSupportedAlgorithms(java.security.AlgorithmConstraints) (line: 155)
at sun.security.ssl.Handshaker.getLocalSupportedSignAlgs() (line: 422)
at sun.security.ssl.ServerHandshaker.clientHello(sun.security.ssl.HandshakeMessage$ClientHello) (line: 700)
at sun.security.ssl.ServerHandshaker.processMessage(byte, int) (line: 213)
at sun.security.ssl.Handshaker.processLoop() (line: 925)
at sun.security.ssl.Handshaker$1.run() (line: 865)
at sun.security.ssl.Handshaker$1.run() (line: 862)
at java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext)
at sun.security.ssl.Handshaker$DelegatedTask.run() (line: 1302)
at org.eclipse.jetty.io.nio.SslConnection.process(org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 375)
at org.eclipse.jetty.io.nio.SslConnection.access$900(org.eclipse.jetty.io.nio.SslConnection, org.eclipse.jetty.io.Buffer, org.eclipse.jetty.io.Buffer) (line: 48)
at org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.fill(org.eclipse.jetty.io.Buffer) (line: 678)
at org.eclipse.jetty.http.HttpParser.fill() (line: 1044)
at org.eclipse.jetty.http.HttpParser.parseNext() (line: 280)
at org.eclipse.jetty.http.HttpParser.parseAvailable() (line: 235)
at org.eclipse.jetty.server.AsyncHttpConnection.handle() (line: 82)
at org.eclipse.jetty.io.nio.SslConnection.handle() (line: 196)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle() (line: 696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run() (line: 53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(java.lang.Runnable) (line: 608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run() (line: 543)
at java.lang.Thread.run() (line: 745)

当一切正常时,在问题出现之前,客户端和服务器之间存在一个具有持久连接的连接池(可以在 netstat 中看到),但是当问题出现时,有很多处于不同状态的连接,除了 ESTABLISHED:

在客户端:

tcp6       0      0 client-host:39962       server-host:443         FIN_WAIT2
tcp6       0      1 client-host:45266       server-host:443         SYN_SENT
tcp6       0      0 client-host:46234       server-host:443         FIN_WAIT2
tcp6       0      0 client-host:38892       server-host:443         FIN_WAIT2
tcp6       0      0 client-host:39160       server-host:443         FIN_WAIT2
tcp6       0      0 client-host:39188       server-host:443         FIN_WAIT2
tcp6       0      1 client-host:43496       server-host:443         SYN_SENT
tcp6       0      0 client-host:46122       server-host:443         FIN_WAIT2
tcp6       0      1 client-host:44938       server-host:443         SYN_SENT
tcp6       0      0 client-host:46446       server-host:443         ESTABLISHED

在服务器端:

tcp        0   2980 server-host:443         client-host-1:34964       LAST_ACK
tcp        0   2980 server-host:443         client-host-3:52430       LAST_ACK
tcp        0   2980 server-host:443         client-host-1:35922       LAST_ACK
tcp        0      0 server-host:443         client-host-1:38362       CLOSE_WAIT
tcp      236      0 server-host:443         client-host-3:58296       CLOSE_WAIT
tcp        0   2980 server-host:443         client-host-1:34980       LAST_ACK
tcp        0   2980 server-host:443         client-host-2:55748       LAST_ACK
tcp        0   2980 server-host:443         client-host-3:53376       LAST_ACK
tcp        0      0 server-host:443         client-host-1:40104       SYN_RECV
tcp        0      0 server-host:443         client-host-1:38718       CLOSE_WAIT
tcp        0   2980 server-host:443         client-host-2:54142       LAST_ACK
tcp        0   2980 server-host:443         client-host-1:50766       LAST_ACK
tcp        0      0 server-host:443         client-host-1:38604       CLOSE_WAIT
tcp      236      0 server-host:443         client-host-3:57604       CLOSE_WAIT
tcp        0   2980 server-host:443         client-host-2:55502       LAST_ACK
tcp        0   2980 server-host:443         client-host-2:58254       LAST_ACK
tcp        0   2980 server-host:443         client-host-1:38042       LAST_ACK
tcp        0      0 server-host:443         client-host-1:38222       CLOSE_WAIT
tcp        0   2980 server-host:443         client-host-3:47812       LAST_ACK
tcp        0   2980 server-host:443         client-host-1:60532       LAST_ACK
tcp        0   2980 server-host:443         client-host-2:54282       LAST_ACK
tcp        0      0 server-host:443         client-host-1:40978       SYN_RECV

几乎所有线程都没有响应,CPU 非常高,GC 一直在工作

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

我们还在 JVM 中设置了以下标志:

-Djava.security.egd=文件:/dev/./urandom

为了 SecureRandom 将是非阻塞的(而不是 /dev/random)

Java 版本“1.8.0_05”Java(TM) SE 运行时环境(构建 1.8.0_05-b13)Java HotSpot(TM) 64 位服务器 VM(构建 25.5-b02,混合模式)

内核:4.14.94-89.73.amzn2.x86_64(但问题也出现在内核为2.6.32-696.20.1.el6.x86_64的系统上)

限制:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 151551
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

当组件进入该状态时,它没有响应,日志没有被写入日志文件(顺便说一句,我们使用 log4j2)

当组件上的负载停止时,组件需要几分钟才能恢复并再次响应

有没有人在你的 Java 后端组件中有类似的行为?请评论/建议调查和/或解决方案的方向

4

1 回答 1

0

SecureRandom尝试从操作系统提供的随机源(例如/dev/random)读取随机字节,但是如果系统没有足够的可用熵,此操作可能会挂起。

一种解决方法是使用另一个不会阻塞的随机源(例如/dev/urandom)。这可以通过更新来配置$JAVA_HOME/jre/lib/security/java.security

securerandom.source=file:/dev/./urandom

另一种选择是安装Haveged可以加快从/dev/random.

此错误报告有更多详细信息:https ://bugs.java.com/bugdatabase/view_bug.do?bug_id=6521844

于 2019-05-06T21:18:40.890 回答