Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
我有一个带有 3 个螺栓(A、B、C)的风暴拓扑,其中中间螺栓平均耗时 450 毫秒,其他两个螺栓不到 1 毫秒。
我能够使用以下并行提示值运行拓扑:
A: 4
B: 700
C: 10
但是当我将 B 的并行提示增加到 1200 时,拓扑不会启动。
在拓扑日志中,我看到了多次加载执行程序的日志:B,如下所示:
2018-05-18 18:56:37.462 o.a.s.d.executor main [INFO] Loading executor B:[111 111]
2018-05-18 18:56:37.463 o.a.s.d.executor main [INFO] Loaded executor tasks B:[111 111]
2018-05-18 18:56:37.465 o.a.s.d.executor main [INFO] Finished loading executor B:[111 111]
2018-05-18 18:56:37.528 o.a.s.d.executor main [INFO] Loading executor B:[355 355]
2018-05-18 18:56:37.529 o.a.s.d.executor main [INFO] Loaded executor tasks B:[355 355]
2018-05-18 18:56:37.530 o.a.s.d.executor main [INFO] Finished loading executor B:[355 355]
2018-05-18 18:56:37.666 o.a.s.d.executor main [INFO] Loading executor B:[993 993]
2018-05-18 18:56:37.667 o.a.s.d.executor main [INFO] Loaded executor tasks B:[993 993]
2018-05-18 18:56:37.669 o.a.s.d.executor main [INFO] Finished loading executor B:[993 993]
2018-05-18 18:56:37.713 o.a.s.d.executor main [INFO] Loading executor B:[765 765]
2018-05-18 18:56:37.714 o.a.s.d.executor main [INFO] Loaded executor tasks B:[765 765]
但是在工作进程之间重新启动。我在拓扑日志或风暴日志中看不到任何错误。以下是风暴日志,当工人重新启动时:
2018-05-18 18:51:46.755 o.a.s.d.s.Container SLOT_6700 [INFO] Killing eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.204 o.a.s.d.s.BasicContainer Thread-7 [INFO] Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
2018-05-18 18:51:47.766 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 109081 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674 -> KILL msInState: 0 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.766 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.774 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700 all processes are dead...
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids/27798
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/heartbeats
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/tmp
2018-05-18 18:51:47.781 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers-users/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up BLOB references...
2018-05-18 18:51:47.784 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up basic files...
2018-05-18 18:51:47.785 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/supervisor/stormdist/myTopology-1-1526649581
2018-05-18 18:51:47.808 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 42 topo:myTopology-1-1526649581 worker:null -> EMPTY msInState: 0
这种情况一直在发生,并且拓扑永远不会重新启动,当 Bolt:B 的并行提示为 700 时,它曾经完美地启动,没有其他变化。
我在这里看到一个有趣的日志,但还不确定这意味着什么:
工作进程 766258fe-a604-4385-8eeb-e85cad38b674 退出,代码:143
有什么建议么?
编辑:
配置:
topology.worker.childopts: -Xms1g -Xmx16g
topology.worker.logwriter.childopts: -Xmx1024m
topology.worker.max.heap.size.mb: 3072.0
worker.childopts: -Xms1g -Xmx16g -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1%ID% -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+UseG1GC -XX:+AggressiveOpts -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/saurabh.mimani/apache-storm-1.2.1/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dorg.newsclub.net.unix.library.path=/usr/share/specter/uds-lib/
worker.gc.childopts:
worker.heap.memory.mb: 8192
supervisor.childopts: -Xms1g -Xmx16g
编辑:
strace -fp PID -e trace=read,write,network,signal,ipc
in gist的日志。
还不能完全理解它,从中看一些相关的东西:
[pid 3362] open("/usr/lib/locale/UTF-8/LC_CTYPE", O_RDONLY) = -1 ENOENT (没有这样的文件或目录)
[pid 3362] 杀死(1487,SIGTERM)= 0
[pid 3362] 关闭(1)