2

最近,我们升级了在我们的生产环境中运行的第三方Web 应用程序的Java (1.6.0_18->1.6.0_38)和版本。我们很快收到警报说服务器上的 CPU 每天有几次超过 50% 的峰值。经过进一步分析,我观察到峰值与并发标记扫描主要 gc 的同时发生,并且完成它们所需的总 CPU 时间大大增加,特别是在 CMS-concurrent-mark 和 CMS-concurrent - 扫描阶段:Tomcat (6.0.32->7.0.34)J2EE

前:

2013-03-08T14:36:49.861-0500: 553875.681: [GC [1 CMS-initial-mark: 4152134K(8303424K)] 4156673K(8380096K), 0.0067893 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2013-03-08T14:36:49.868-0500: 553875.688: [CMS-concurrent-mark-start]
2013-03-08T14:36:55.682-0500: 553881.503: [GC 553881.503: [ParNew: 72675K->4635K(76672K), 0.0322031 secs] 4224809K->4157567K(8380096K), 0.0327540 secs] [Times: user=0.12 sys=0.01, real=0.03 secs] 
2013-03-08T14:36:58.224-0500: 553884.045: [CMS-concurrent-mark: 8.320/8.356 secs] [Times: user=9.18 sys=0.02, real=8.36 secs] 
2013-03-08T14:36:58.224-0500: 553884.045: [CMS-concurrent-preclean-start]
2013-03-08T14:36:58.276-0500: 553884.097: [CMS-concurrent-preclean: 0.051/0.052 secs] [Times: user=0.06 sys=0.00, real=0.05 secs] 
2013-03-08T14:36:58.277-0500: 553884.097: [CMS-concurrent-abortable-preclean-start]
2013-03-08T14:37:01.458-0500: 553887.279: [GC 553887.279: [ParNew: 72795K->4887K(76672K), 0.0332472 secs] 4225727K->4158532K(8380096K), 0.0337703 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] 
 CMS: abort preclean due to time 2013-03-08T14:37:03.296-0500: 553889.117: [CMS-concurrent-abortable-preclean: 1.462/5.020 secs] [Times: user=2.04 sys=0.02, real=5.02 secs] 
2013-03-08T14:37:03.299-0500: 553889.119: [GC[YG occupancy: 22614 K (76672 K)]553889.120: [Rescan (parallel) , 0.0151518 secs]553889.135: [weak refs processing, 0.0356825 secs] [1 CMS-remark: 4153644K(8303424K)] 4176259K(8380096K), 0.0620445 secs] [Times: user=0.11 sys=0.00, real=0.06 secs] 
2013-03-08T14:37:03.363-0500: 553889.183: [CMS-concurrent-sweep-start]
2013-03-08T14:37:07.248-0500: 553893.069: [GC 553893.069: [ParNew: 73047K->5136K(76672K), 0.0510894 secs] 3182253K->3115235K(8380096K), 0.0516111 secs] [Times: user=0.19 sys=0.00, real=0.05 secs] 
2013-03-08T14:37:08.277-0500: 553894.097: [CMS-concurrent-sweep: 4.856/4.914 secs] [Times: user=5.67 sys=0.02, real=4.91 secs] 
2013-03-08T14:37:08.277-0500: 553894.097: [CMS-concurrent-reset-start]
2013-03-08T14:37:08.325-0500: 553894.145: [CMS-concurrent-reset: 0.048/0.048 secs] [Times: user=0.07 sys=0.00, real=0.05 secs] 

后:

2013-03-07T17:18:01.323-0500: 180055.128: [CMS-concurrent-mark: 10.765/20.646 secs] [Times: user=50.25 sys=3.32, real=20.65 secs] 
2013-03-07T17:18:01.323-0500: 180055.128: [CMS-concurrent-preclean-start]
2013-03-07T17:18:01.401-0500: 180055.206: [CMS-concurrent-preclean: 0.076/0.078 secs] [Times: user=0.08 sys=0.00, real=0.08 secs] 
2013-03-07T17:18:01.401-0500: 180055.206: [CMS-concurrent-abortable-preclean-start]
2013-03-07T17:18:03.074-0500: 180056.879: [GC 180056.880: [ParNew: 76670K->8512K(76672K), 0.1024039 secs] 5980843K->5922977K(8380096K), 0.1028797 secs] [Times: user=0.28 sys=0.04, real=0.10 secs] 
2013-03-07T17:18:05.447-0500: 180059.253: [CMS-concurrent-abortable-preclean: 3.132/4.046 secs] [Times: user=3.94 sys=0.07, real=4.05 secs] 
2013-03-07T17:18:05.448-0500: 180059.254: [GC[YG occupancy: 51161 K (76672 K)]180059.254: [Rescan (parallel) , 0.0243232 secs]180059.279: [weak refs processing, 0.2053571 secs] [1 CMS-remark: 5914465K(8303424K)] 5965627K(8380096K), 0.2569077 secs] [Times: user=0.33 sys=0.01, real=0.26 secs] 
2013-03-07T17:18:05.706-0500: 180059.512: [CMS-concurrent-sweep-start]
2013-03-07T17:18:12.511-0500: 180066.316: [CMS-concurrent-sweep: 6.804/6.804 secs] [Times: user=13.98 sys=0.80, real=6.80 secs] 
2013-03-07T17:18:12.511-0500: 180066.316: [CMS-concurrent-reset-start]
2013-03-07T17:18:12.558-0500: 180066.363: [CMS-concurrent-reset: 0.047/0.047 secs] [Times: user=0.11 sys=0.02, real=0.05 secs] 

在这些持续约一分钟的峰值期间,Tomcat 服务器响应时间从平均 2 毫秒变为大约 90 秒。在生产 3 天后,我们回滚了更改,此后没有看到 CPU 峰值。您是否知道可能导致此行为的 JDK 或 Tomcat 中的任何更改?注意:此 Web 应用程序在堆中缓存了大量数据(启动时高达 3GB)。

以下是 JVM 设置:

(Before) Tomcat 6 / JDK 1.6.0_18:
JAVA_HOME="/usr/local/java/jdk1.6.0_18"
JAVA_OPTS="$JAVA_OPTS -server -d64 -XX:PermSize=128m -XX:MaxPermSize=128m"
CATALINA_OPTS="$CATALINA_OPTS -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:/env/tomcat-instance/logs/gc.log -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=(omitted) -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

(After) Tomcat 7 / JDK 1.6.0_38:
JAVA_HOME="/usr/local/java/jdk1.6.0_38"
JAVA_OPTS="$JAVA_OPTS -server -d64 -XX:PermSize=128m -XX:MaxPermSize=128m"
CATALINA_OPTS="$CATALINA_OPTS -Xms8192m -Xmx8192m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:/env/tomcat-instance/logs/gc.log -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=(omitted) -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

很感谢任何形式的帮助。

4

2 回答 2

0

轶事反馈 - 我们在 6u2n 中遇到了一个严重的内存泄漏错误,直到 7 才修复:

http://bugs.sun.com/view_bug.do?bug_id=7013538

http://bugs.sun.com/view_bug.do?bug_id=7042582

6u21 是我使用过的最安全的 Java 6 JRE。

于 2013-03-19T17:59:28.087 回答
0

您同时升级了 Tomcat 和 JVM,因此峰值可能是由其中任何一个引起的。您可以限制 GC 的线程数。

-XX:ParallelGCThreads=12

如果您运行多个 JVM,请确保 GC 线程数不超过内核数。也看看 JVM 1.7。

使用这些来查看有效的 jvm 参数并查找更改:

-XX:+UnlockDiagnosticVMOptions
-XX:+PrintFlagsFinal
-XX:+LogVMOutput
-XX:LogFile=logs/jvm.log
于 2013-03-20T06:52:56.263 回答