我有一个ReplicaSet
三个节点,192.168.254.107:27023、192.168.254.108:27023、192.168.2.69:27023。
当 192.168.254.108:27023 关闭(意外关闭)时,192.168.2.69:27023 成为主要的。
但过了一会儿,放弃初级,过渡到次级。选举和放弃不断循环执行。
ReplicaSet 是集群的一个分片。
这是日志:
403174 2016-07-20T09:41:32.054+0800 W NETWORK [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403175 2016-07-20T09:41:32.225+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403176 2016-07-20T09:41:32.226+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403177 2016-07-20T09:41:32.226+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403178 2016-07-20T09:41:34.218+0800 I REPL [ReplicationExecutor] can't see a majority of the set, relinquishing primary
403179 2016-07-20T09:41:34.218+0800 I REPL [ReplicationExecutor] Stepping down from primary in response to heartbeat
403180 2016-07-20T09:41:34.218+0800 I REPL [replExecDBWorker-1] transition to SECONDARY
403181 2016-07-20T09:41:34.222+0800 I REPL [ReplicationExecutor] Member 192.168.254.107:27023 is now in state SECONDARY
403182 2016-07-20T09:41:34.226+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403183 2016-07-20T09:41:34.227+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403184 2016-07-20T09:41:34.228+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403185 2016-07-20T09:41:39.228+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403186 2016-07-20T09:41:39.229+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403187 2016-07-20T09:41:39.230+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403188 2016-07-20T09:41:42.057+0800 W NETWORK [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403189 2016-07-20T09:41:42.057+0800 W NETWORK [ReplicaSetMonitorWatcher] No primary detected for set shard1
403190 2016-07-20T09:41:43.066+0800 I NETWORK [LockPinger] SyncClusterConnection connecting to [192.168.2.69:20001]
403191 2016-07-20T09:41:43.066+0800 I NETWORK [LockPinger] SyncClusterConnection connecting to [192.168.254.108:20001]
403192 2016-07-20T09:41:43.067+0800 I NETWORK [LockPinger] SyncClusterConnection connecting to [192.168.254.107:20001]
403193 2016-07-20T09:41:43.283+0800 I SHARDING [LockPinger] cluster 192.168.2.69:20001,192.168.254.108:20001,192.168.254.107:20001 pinged successfully at 2016-07-20T09:41:43.068+0800 by distributed lock pinger '192.168.2.69:20001,192.168.254.108:20001,192.168.254.107:20001/bdc9:27023:1467257950:-742828636', sleeping for 30000ms
403194 2016-07-20T09:41:44.230+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403195 2016-07-20T09:41:44.231+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403196 2016-07-20T09:41:44.232+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403197 2016-07-20T09:41:45.057+0800 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
403198 2016-07-20T09:41:45.057+0800 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
403199 2016-07-20T09:41:45.058+0800 I REPL [ReplicationExecutor] VoteRequester: Got failed response from 192.168.254.108:27023: HostUnreachable: Connection ref used
403200 2016-07-20T09:41:45.058+0800 I REPL [ReplicationExecutor] dry election run succeeded, running for election
403201 2016-07-20T09:41:45.059+0800 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 5040
403202 2016-07-20T09:41:45.059+0800 I REPL [ReplicationExecutor] transition to PRIMARY
403203 2016-07-20T09:41:45.059+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403204 2016-07-20T09:41:45.059+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403205 2016-07-20T09:41:45.060+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403206 2016-07-20T09:41:45.645+0800 I REPL [rsSync] transition to primary complete; database writes are now permitted
……
……
……
403217 2016-07-20T09:41:49.061+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403218 2016-07-20T09:41:49.062+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403219 2016-07-20T09:41:49.063+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403220 2016-07-20T09:41:51.064+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403221 2016-07-20T09:41:51.064+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403222 2016-07-20T09:41:51.065+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403223 2016-07-20T09:41:52.058+0800 W NETWORK [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403224 2016-07-20T09:41:53.065+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403225 2016-07-20T09:41:53.066+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403226 2016-07-20T09:41:53.067+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403227 2016-07-20T09:41:55.059+0800 I REPL [ReplicationExecutor] can't see a majority of the set, relinquishing primary
403228 2016-07-20T09:41:55.059+0800 I REPL [ReplicationExecutor] Stepping down from primary in response to heartbeat
403229 2016-07-20T09:41:55.059+0800 I REPL [replExecDBWorker-0] transition to SECONDARY
403230 2016-07-20T09:41:55.064+0800 I REPL [ReplicationExecutor] Member 192.168.254.107:27023 is now in state SECONDARY
403231 2016-07-20T09:41:55.068+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403232 2016-07-20T09:41:55.068+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403233 2016-07-20T09:41:55.069+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403234 2016-07-20T09:42:00.070+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403235 2016-07-20T09:42:00.071+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403236 2016-07-20T09:42:00.071+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403237 2016-07-20T09:42:02.061+0800 W NETWORK [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403238 2016-07-20T09:42:02.061+0800 W NETWORK [ReplicaSetMonitorWatcher] No primary detected for set shard1
403239 2016-07-20T09:42:05.071+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403240 2016-07-20T09:42:05.072+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403241 2016-07-20T09:42:05.073+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403242 2016-07-20T09:42:06.304+0800 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
403243 2016-07-20T09:42:06.304+0800 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
403244 2016-07-20T09:42:06.305+0800 I REPL [ReplicationExecutor] VoteRequester: Got failed response from 192.168.254.108:27023: HostUnreachable: Connection refused
403245 2016-07-20T09:42:06.305+0800 I REPL [ReplicationExecutor] dry election run succeeded, running for election
403246 2016-07-20T09:42:06.306+0800 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 5041
403247 2016-07-20T09:42:06.306+0800 I REPL [ReplicationExecutor] transition to PRIMARY
403248 2016-07-20T09:42:06.306+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403249 2016-07-20T09:42:06.307+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403250 2016-07-20T09:42:06.307+0800 I REPL [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403251 2016-07-20T09:42:06.647+0800 I REPL [rsSync] transition to primary complete; database writes are now permitted
这是分片状态:
shard1:SECONDARY> rs.status()
{
"set" : "shard1",
"date" : ISODate("2016-07-20T03:20:49.001Z"),
"myState" : 2,
"term" : NumberLong(5326),
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "192.168.254.107:27023",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 8,
"optime" : {
"ts" : Timestamp(1468879844, 1),
"t" : NumberLong(267)
},
"optimeDate" : ISODate("2016-07-18T22:10:44Z"),
"lastHeartbeat" : ISODate("2016-07-20T03:20:45.947Z"),
"lastHeartbeatRecv" : ISODate("2016-07-05T08:38:08.083Z"),
"pingMs" : NumberLong(0),
"configVersion" : 1
},
{
"_id" : 1,
"name" : "192.168.254.108:27023",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : {
"ts" : Timestamp(0, 0),
"t" : NumberLong(-1)
},
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2016-07-20T03:20:45.952Z"),
"lastHeartbeatRecv" : ISODate("2016-07-18T22:10:46.930Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "Connection refused",
"configVersion" : -1
},
{
"_id" : 2,
"name" : "192.168.2.69:27023",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1727777,
"optime" : {
"ts" : Timestamp(1468984831, 1),
"t" : NumberLong(5326)
},
"optimeDate" : ISODate("2016-07-20T03:20:31Z"),
"infoMessage" : "could not find member to sync from",
"configVersion" : 1,
"self" : true
}
],
"ok" : 1,
"$gleStats" : {
"lastOpTime" : Timestamp(0, 0),
"electionId" : ObjectId("7fffffff00000000000014ce")
}
}
它是如何发生的?
这是 的配置ReplicaSet
,当它没问题时:
shard1:PRIMARY> rs.conf()
{
"_id" : "shard1",
"version" : 1,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 0,
"host" : "192.168.254.107:27023",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 1,
"host" : "192.168.254.108:27023",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "192.168.2.69:27023",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("577493542ff2d62af240fe4f")
}
}