1

我正在开发企业 Java 应用程序,其中已经有很多工具/框架,例如 Struts、JAX-RS 和 Spring MVC。它包含捆绑在 .war 文件中的 UI 和 REST 端点。该项目正在发展,我们正在摆脱旧工具,努力坚持只使用 Spring MVC/Webflux。

应用程序正在对数百万条 XML/JSON 记录执行搜索,最近搜索引擎已从 Marklogic 切换到 Elasticsearch。

我们注意到的是,在使用量不高的生产环境中(在 2-4 个应用程序节点上高达 1.7k rpm),某些端点上的响应时间会随着时间的推移而增加。Elasticsearch 有增长的空间,并且没有显示出任何巨大负载的迹象。因此,当平均响应时间超过3 秒而不是常规的200-300毫秒时,目前我们必须在一两周内重新启动/更换一次产品实例。

我尝试使用async-profiler获取 CPU 和堆火焰图,但负载配置文件在每次测量时都会发生变化,因为我们有很多可用的功能,所以我无法真正比​​较图表如何随时间变化。

你能告诉我一些在代码中找到合适位置的策略/方法吗?

4

1 回答 1

1

发现问题。它与线程池有关。

我们注意到随着时间的推移,活跃的 tomcat 线程的数量与响应时间一起增长: 一段时间内tomcat线程池的使用情况 在图像上,您还可以看到服务器在 5 月 9 日重新启动。

在服务器重新启动之前,我能够获得一个堆转储,并且在一些挖掘之后在线程转储中发现了一个有趣的重复片段:

Thread xxx
  at sun.misc.Unsafe.park(ZJ)V (Native Method)
  at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V (LockSupport.java:175)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()V (AbstractQueuedSynchronizer.java:2039)
  at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(Ljava/lang/Object;Ljava/lang/Object;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/Future;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:377)
  at org.apache.http.pool.AbstractConnPool.access$200(Lorg/apache/http/pool/AbstractConnPool;Ljava/lang/Object;Ljava/lang/Object;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/Future;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:67)
  at org.apache.http.pool.AbstractConnPool$2.get(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/pool/PoolEntry; (AbstractConnPool.java:243)
  at org.apache.http.pool.AbstractConnPool$2.get(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; (AbstractConnPool.java:191)
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(Ljava/util/concurrent/Future;JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/HttpClientConnection; (PoolingHttpClientConnectionManager.java:282)
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(JLjava/util/concurrent/TimeUnit;)Lorg/apache/http/HttpClientConnection; (PoolingHttpClientConnectionManager.java:269)
  at org.apache.http.impl.execchain.MainClientExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (MainClientExec.java:191)
  at org.apache.http.impl.execchain.ProtocolExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (ProtocolExec.java:185)
  at org.apache.http.impl.execchain.RetryExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (RetryExec.java:89)
  at org.apache.http.impl.execchain.RedirectExec.execute(Lorg/apache/http/conn/routing/HttpRoute;Lorg/apache/http/client/methods/HttpRequestWrapper;Lorg/apache/http/client/protocol/HttpClientContext;Lorg/apache/http/client/methods/HttpExecutionAware;)Lorg/apache/http/client/methods/CloseableHttpResponse; (RedirectExec.java:111)
  at org.apache.http.impl.client.InternalHttpClient.doExecute(Lorg/apache/http/HttpHost;Lorg/apache/http/HttpRequest;Lorg/apache/http/protocol/HttpContext;)Lorg/apache/http/client/methods/CloseableHttpResponse; (InternalHttpClient.java:185)
  at org.apache.http.impl.client.CloseableHttpClient.execute(Lorg/apache/http/client/methods/HttpUriRequest;Lorg/apache/http/protocol/HttpContext;)Lorg/apache/http/client/methods/CloseableHttpResponse; (CloseableHttpClient.java:83)
  at org.apache.http.impl.client.CloseableHttpClient.execute(Lorg/apache/http/client/methods/HttpUriRequest;)Lorg/apache/http/client/methods/CloseableHttpResponse; (CloseableHttpClient.java:108)
  at io.searchbox.client.http.JestHttpClient.executeRequest(Lorg/apache/http/client/methods/HttpUriRequest;)Lorg/apache/http/client/methods/CloseableHttpResponse; (JestHttpClient.java:136)
  at io.searchbox.client.http.JestHttpClient.execute(Lio/searchbox/action/Action;Lorg/apache/http/client/config/RequestConfig;)Lio/searchbox/client/JestResult; (JestHttpClient.java:70)
  at io.searchbox.client.http.JestHttpClient.execute(Lio/searchbox/action/Action;)Lio/searchbox/client/JestResult; (JestHttpClient.java:63)
...

在我们的例子中,我们使用 Jest 库与 Elasticsearch 对话。在内部,它使用 Apache HTTP 客户端和 Apache HTTP 异步客户端。

正如您在线程转储中看到的,很明显该线程正在等待 HTTP 客户端线程池中的可用线程。并且有更多线程具有完全相同的堆栈。

我还发现,我们将maxTotal(最大连接总数)20defaultMaxPerRoute(每条路由的最大连接数)设置为2

默认情况下,池总共只允许 20 个并发连接,每个唯一路由两个并发连接。两个连接的限制是由于 HTTP 规范的要求。然而,实际上,这往往过于严格。

请参阅连接池说明

所以我所做的修复是将这些值分别增加到5040。我仍然希望这个参数不受限制并随着使用而增长,但现在坚持这些值。

于 2020-05-14T08:24:04.300 回答