1

目前在 10 节点集群上遇到问题,运行大约一天后,3 个节点将退出(总是随机的 3 个)。

Riak 版本:2.1.4

10 个虚拟机,每个运行 10GB 内存,运行 Oracle Linux 7.3 版

Java版本:

[riak@pp2xria01trd001 riak$] java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

我们平常的 Riak 家伙此刻正在度假,所以没有太多资源可调查。任何关于从哪里开始寻找的帮助或指导将不胜感激。

故障转储详细信息:

Slogan: Kernel pid terminated (application_controller) ({application_terminated,yokozuna,shutdown})System version: Erlang 
R16B02_basho10 (erts-5.10.3) [source] [64-bit] [smp:2:2] [async-threads:64] [hipe] [kernel-poll:true] [frame-pointer]

solr.log 中没有太多详细说明原因:

2017-04-06 21:04:13,958 [INFO] <qtp1924582348-828>@LogUpdateProcessorFactory.java:198 [marketblueprints_index] webapp=/internal_solr path=/update params={} {} 0 0
2017-04-06 21:04:18,567 [INFO] <qtp1924582348-855>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:23,573 [INFO] <qtp1924582348-1161>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:28,578 [INFO] <qtp1924582348-865>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:33,584 [INFO] <qtp1924582348-848>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:38,589 [INFO] <qtp1924582348-641>@SolrDispatchFilter.java:732 [admin] webapp=null path=/admin/cores params={action=STATUS&wt=json} status=0 QTime=2
2017-04-06 21:04:54,242 [INFO] <Thread-1>@Monitor.java:41 Yokozuna has exited - shutting down Solr
2017-04-06 21:04:55,219 [INFO] <Thread-2>@Server.java:320 Graceful shutdown SocketConnector@0.0.0.0:8093
2017-04-06 21:04:56,027 [INFO] <Thread-2>@Server.java:329 Graceful shutdown o.e.j.w.WebAppContext{/internal_solr,file:/var/lib/riak/yz_temp/solr-webapp/webapp/},/usr/lib64/
riak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/webapps/solr.war
2017-04-06 21:04:59,288 [INFO] <Thread-2>@CoreContainer.java:314 Shutting down CoreContainer instance=1916575798
2017-04-06 21:04:59,710 [INFO] <Thread-2>@SolrCore.java:1040 [feed_mapping_index]  CLOSING SolrCore org.apache.solr.core.SolrCore@78acc5b

但是,在 solr.log 中的一些合并过程之后,我们得到以下结果(我怀疑这会阻止主管第二次重新启动它,从而停止 Riak

2017-04-06 21:05:13,546 [INFO] <Thread-2>@CachingDirectoryFactory.java:305 Closing directory: /var/lib/riak/yz/endpoint_mappings_index/data
2017-04-06 21:05:13,547 [INFO] <Thread-2>@CachingDirectoryFactory.java:236 looking to close /var/lib/riak/yz/endpoint_mappings_index/data/index [CachedDir<<refCount=0;path=
/var/lib/riak/yz/endpoint_mappings_index/data/index;done=false>>]
2017-04-06 21:05:13,547 [INFO] <Thread-2>@CachingDirectoryFactory.java:305 Closing directory: /var/lib/riak/yz/endpoint_mappings_index/data/index
2017-04-06 21:05:14,657 [INFO] <Thread-2>@ContextHandler.java:832 stopped o.e.j.w.WebAppContext{/internal_solr,file:/var/lib/riak/yz_temp/solr-webapp/webapp/},/usr/lib64/ri
ak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/webapps/solr.war
2017-04-06 21:05:15,298 [WARN] <Thread-2>@QueuedThreadPool.java:145 79 threads could not be stopped

Erlang.log 包含:

2017-04-06 21:04:54.193 [error] <0.5934.108> gen_server yz_solr_proc terminated with reason: {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection,{url,"http://localhost:
8093/internal_solr/admin/cores?action=STATUS&wt=json","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,hostname},100,1,{[],false
},[]}]}}
2017-04-06 21:04:54.198 [error] <0.5934.108> CRASH REPORT Process yz_solr_proc with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection
,{url,"http://localhost:8093/internal_solr/admin/cores?action=STATUS&wt=json","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,h
ostname},100,1,{[],false},[]}]}} in gen_server:terminate/6 line 744
2017-04-06 21:04:54.201 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.5934.108> exit with reason {timeout,{gen_server,call,[<0.1306.0>,{spawn_connection,{url,"http://localhost:8093/internal_solr/admin/cores?action=STATUS&wt=j
son","localhost",8093,undefined,undefined,"/internal_solr/admin/cores?action=STATUS&wt=json",http,hostname},100,1,{[],false},[]}]}} in context child_terminated
2017-04-06 21:04:57.422 [info] <0.1102.0>@riak_ensemble_peer:leading:631 {{kv,1141798154164767904846628775559596109106197299200,3,114179815416476790484662877555959610910619
7299200},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:04:57.422 [info] <0.1090.0>@riak_ensemble_peer:leading:631 {{kv,685078892498860742907977265335757665463718379520,3,6850788924988607429079772653357576654637183
79520},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:04:57.780 [info] <0.1072.0>@riak_ensemble_peer:leading:631 {{kv,0,3,0},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:01.432 [info] <0.8030.232>@yz_solr_proc:init:119 Starting solr: "/usr/bin/riak/java" ["-Djava.awt.headless=true","-Djetty.home=/usr/lib64/riak/lib/yokozuna
-2.1.7-0-g6cf80ad/priv/solr","-Djetty.temp=/var/lib/riak/yz_temp","-Djetty.port=8093","-Dsolr.solr.home=/var/lib/riak/yz","-DhostContext=/internal_solr","-cp","/usr/lib64/r
iak/lib/yokozuna-2.1.7-0-g6cf80ad/priv/solr/start.jar","-Dlog4j.configuration=file:///etc/riak/solr-log4j.properties","-Dyz.lib.dir=/usr/lib64/riak/lib/yokozuna-2.1.7-0-g6c
f80ad/priv/java_lib","-d64","-Xms4g","-Xmx4g","-XX:+UseStringCache","-XX:+UseCompressedOops","-Dcom.sun.management.jmxremote.port=8985","-Dcom.sun.management.jmxremote.auth
enticate=false","-Dcom.sun.management.jmxremote.ssl=false","org.eclipse.jetty.start.Main"]
2017-04-06 21:05:01.483 [info] <0.1108.0>@riak_ensemble_peer:leading:631 {{kv,1370157784997721485815954530671515330927436759040,3,137015778499772148581595453067151533092743
6759040},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:02.032 [info] <0.8030.232>@yz_solr_proc:handle_info:184 solr stdout/err: OpenJDK 64-Bit Server VM warning: ignoring option UseSplitVerifier; support was re
moved in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option UseStringCache; support was removed in 8.0

2017-04-06 21:05:04.212 [info] <0.1110.0>@riak_ensemble_peer:leading:631 {{kv,1415829711164312202009819681693899175291684651008,3,0},'riak@pp2xria01trd001.pp2.williamhill.p
lc'}: Leading
2017-04-06 21:05:10.798 [info] <0.1096.0>@riak_ensemble_peer:leading:631 {{kv,913438523331814323877303020447676887284957839360,3,9134385233318143238773030204476768872849578
39360},'riak@pp2xria01trd001.pp2.williamhill.plc'}: Leading
2017-04-06 21:05:17.001 [info] <0.8030.232>@yz_solr_proc:handle_info:184 solr stdout/err: Error: Exception thrown by the agent : java.rmi.server.ExportException: Port alrea
dy in use: 8985; nested exception is:
        java.net.BindException: Address already in use (Bind failed)

2017-04-06 21:05:17.964 [error] <0.8030.232> gen_server yz_solr_proc terminated with reason: {"solr OS process exited",1}
2017-04-06 21:05:17.964 [error] <0.8030.232> CRASH REPORT Process yz_solr_proc with 0 neighbours exited with reason: {"solr OS process exited",1} in gen_server:terminate/6 
line 744
2017-04-06 21:05:17.964 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.8030.232> exit with reason {"solr OS process exited",1} in context child_terminated
2017-04-06 21:05:17.964 [error] <0.1150.0> Supervisor yz_solr_sup had child yz_solr_proc started with yz_solr_proc:start_link("/var/lib/riak/yz", "/var/lib/riak/yz_temp", 8
093, 8985) at <0.8030.232> exit with reason reached_max_restart_intensity in context shutdown
2017-04-06 21:05:17.964 [error] <0.1119.0> Supervisor yz_sup had child yz_solr_sup started with yz_solr_sup:start_link() at <0.1150.0> exit with reason shutdown in context 
child_terminated
2017-04-06 21:05:17.964 [error] <0.1119.0> Supervisor yz_sup had child yz_solr_sup started with yz_solr_sup:start_link() at <0.1150.0> exit with reason reached_max_restart_
intensity in context shutdown
2017-04-06 21:05:23.072 [error] <0.1551.0> Supervisor yz_index_hashtree_sup had child ignored started with yz_index_hashtree:start_link() at undefined exit with reason kill
ed in context shutdown_error
2017-04-06 21:05:24.353 [info] <0.745.0>@yz_app:prep_stop:74 Stopping application yokozuna.
2017-04-06 21:05:27.582 [error] <0.745.0>@yz_app:prep_stop:82 Stopping application yokozuna - exit:{noproc,{gen_server,call,[yz_solrq_drain_mgr,{drain,[]},infinity]}}.
2017-04-06 21:05:27.582 [info] <0.745.0>@yz_app:stop:88 Stopped application yokozuna.
2017-04-06 21:05:27.940 [info] <0.7.0> Application yokozuna exited with reason: shutdown
2017-04-06 21:05:28.165 [info] <0.431.0>@riak_kv_app:prep_stop:228 Stopping application riak_kv - marked service down.
2017-04-06 21:05:28.252 [info] <0.431.0>@riak_kv_app:prep_stop:232 Unregistered pb services
2017-04-06 21:05:28.408 [info] <0.431.0>@riak_kv_app:prep_stop:237 unregistered webmachine routes
2017-04-06 21:05:28.459 [info] <0.431.0>@riak_kv_app:prep_stop:239 all active put FSMs completed
2017-04-06 21:05:29.665 [info] <0.540.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_hook) host stopping (<0.540.0>)
2017-04-06 21:05:29.665 [info] <0.539.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_hook) host stopping (<0.539.0>)
2017-04-06 21:05:30.379 [info] <0.532.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.532.0>)
2017-04-06 21:05:31.116 [info] <0.534.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.534.0>)
2017-04-06 21:05:31.362 [info] <0.533.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.533.0>)
2017-04-06 21:05:32.153 [info] <0.536.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.536.0>)
2017-04-06 21:05:32.245 [info] <0.537.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.537.0>)
2017-04-06 21:05:32.676 [info] <0.535.0>@riak_kv_js_vm:terminate:237 Spidermonkey VM (pool: riak_kv_js_reduce) host stopping (<0.535.0>)
2017-04-06 21:05:33.450 [info] <0.431.0>@riak_kv_app:stop:250 Stopped  application riak_kv.
2017-04-06 21:05:41.701 [info] <0.195.0>@riak_core_app:stop:116 Stopped  application riak_core.
2017-04-06 21:05:43.061 [info] <0.93.0> alarm_handler: {clear,system_memory_high_watermark}

我们在 riak.conf 中添加了额外的选项

search = on
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms4g -Xmx4g -XX:+UseStringCache -XX:+UseCompressedOops
search.solr.port = 8093
search.solr.start_timeout = 180s

没有任何 OOM 错误或进程被 oom_killer 杀死的迹象

4

0 回答 0