1

我有一个ReplicaSet三个节点,192.168.254.107:27023、192.168.254.108:27023、192.168.2.69:27023。

当 192.168.254.108:27023 关闭(意外关闭)时,192.168.2.69:27023 成为主要的。
但过了一会儿,放弃初级,过渡到次级。选举和放弃不断循环执行。

ReplicaSet 是集群的一个分片。

这是日志:

403174 2016-07-20T09:41:32.054+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403175 2016-07-20T09:41:32.225+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403176 2016-07-20T09:41:32.226+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403177 2016-07-20T09:41:32.226+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403178 2016-07-20T09:41:34.218+0800 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
403179 2016-07-20T09:41:34.218+0800 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
403180 2016-07-20T09:41:34.218+0800 I REPL     [replExecDBWorker-1] transition to SECONDARY
403181 2016-07-20T09:41:34.222+0800 I REPL     [ReplicationExecutor] Member 192.168.254.107:27023 is now in state SECONDARY
403182 2016-07-20T09:41:34.226+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403183 2016-07-20T09:41:34.227+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403184 2016-07-20T09:41:34.228+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403185 2016-07-20T09:41:39.228+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403186 2016-07-20T09:41:39.229+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403187 2016-07-20T09:41:39.230+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403188 2016-07-20T09:41:42.057+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403189 2016-07-20T09:41:42.057+0800 W NETWORK  [ReplicaSetMonitorWatcher] No primary detected for set shard1
403190 2016-07-20T09:41:43.066+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [192.168.2.69:20001]
403191 2016-07-20T09:41:43.066+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [192.168.254.108:20001]
403192 2016-07-20T09:41:43.067+0800 I NETWORK  [LockPinger] SyncClusterConnection connecting to [192.168.254.107:20001]
403193 2016-07-20T09:41:43.283+0800 I SHARDING [LockPinger] cluster 192.168.2.69:20001,192.168.254.108:20001,192.168.254.107:20001 pinged successfully at 2016-07-20T09:41:43.068+0800 by distributed lock pinger '192.168.2.69:20001,192.168.254.108:20001,192.168.254.107:20001/bdc9:27023:1467257950:-742828636', sleeping for 30000ms
403194 2016-07-20T09:41:44.230+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403195 2016-07-20T09:41:44.231+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403196 2016-07-20T09:41:44.232+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403197 2016-07-20T09:41:45.057+0800 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
403198 2016-07-20T09:41:45.057+0800 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
403199 2016-07-20T09:41:45.058+0800 I REPL     [ReplicationExecutor] VoteRequester: Got failed response from 192.168.254.108:27023: HostUnreachable: Connection ref       used
403200 2016-07-20T09:41:45.058+0800 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
403201 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 5040
403202 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] transition to PRIMARY
403203 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403204 2016-07-20T09:41:45.059+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403205 2016-07-20T09:41:45.060+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403206 2016-07-20T09:41:45.645+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted
……
……
……
403217 2016-07-20T09:41:49.061+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403218 2016-07-20T09:41:49.062+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403219 2016-07-20T09:41:49.063+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403220 2016-07-20T09:41:51.064+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403221 2016-07-20T09:41:51.064+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403222 2016-07-20T09:41:51.065+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403223 2016-07-20T09:41:52.058+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403224 2016-07-20T09:41:53.065+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403225 2016-07-20T09:41:53.066+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403226 2016-07-20T09:41:53.067+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403227 2016-07-20T09:41:55.059+0800 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
403228 2016-07-20T09:41:55.059+0800 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
403229 2016-07-20T09:41:55.059+0800 I REPL     [replExecDBWorker-0] transition to SECONDARY
403230 2016-07-20T09:41:55.064+0800 I REPL     [ReplicationExecutor] Member 192.168.254.107:27023 is now in state SECONDARY
403231 2016-07-20T09:41:55.068+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403232 2016-07-20T09:41:55.068+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403233 2016-07-20T09:41:55.069+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403234 2016-07-20T09:42:00.070+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403235 2016-07-20T09:42:00.071+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403236 2016-07-20T09:42:00.071+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403237 2016-07-20T09:42:02.061+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 192.168.254.108:27023, reason: errno:111 Connection refused
403238 2016-07-20T09:42:02.061+0800 W NETWORK  [ReplicaSetMonitorWatcher] No primary detected for set shard1
403239 2016-07-20T09:42:05.071+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403240 2016-07-20T09:42:05.072+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403241 2016-07-20T09:42:05.073+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403242 2016-07-20T09:42:06.304+0800 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
403243 2016-07-20T09:42:06.304+0800 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
403244 2016-07-20T09:42:06.305+0800 I REPL     [ReplicationExecutor] VoteRequester: Got failed response from 192.168.254.108:27023: HostUnreachable: Connection refused
403245 2016-07-20T09:42:06.305+0800 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
403246 2016-07-20T09:42:06.306+0800 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 5041
403247 2016-07-20T09:42:06.306+0800 I REPL     [ReplicationExecutor] transition to PRIMARY
403248 2016-07-20T09:42:06.306+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403249 2016-07-20T09:42:06.307+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403250 2016-07-20T09:42:06.307+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 192.168.254.108:27023; HostUnreachable: Connection refused
403251 2016-07-20T09:42:06.647+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted

这是分片状态:

shard1:SECONDARY> rs.status()
{
    "set" : "shard1",
    "date" : ISODate("2016-07-20T03:20:49.001Z"),
    "myState" : 2,
    "term" : NumberLong(5326),
    "heartbeatIntervalMillis" : NumberLong(2000),
    "members" : [
        {
            "_id" : 0,
            "name" : "192.168.254.107:27023",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 8,
            "optime" : {
                "ts" : Timestamp(1468879844, 1),
                "t" : NumberLong(267)
            },
            "optimeDate" : ISODate("2016-07-18T22:10:44Z"),
            "lastHeartbeat" : ISODate("2016-07-20T03:20:45.947Z"),
            "lastHeartbeatRecv" : ISODate("2016-07-05T08:38:08.083Z"),
            "pingMs" : NumberLong(0),
            "configVersion" : 1
        },
        {
            "_id" : 1,
            "name" : "192.168.254.108:27023",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : {
                "ts" : Timestamp(0, 0),
                "t" : NumberLong(-1)
            },
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2016-07-20T03:20:45.952Z"),
            "lastHeartbeatRecv" : ISODate("2016-07-18T22:10:46.930Z"),
            "pingMs" : NumberLong(0),
            "lastHeartbeatMessage" : "Connection refused",
            "configVersion" : -1
        },
        {
            "_id" : 2,
            "name" : "192.168.2.69:27023",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 1727777,
            "optime" : {
                "ts" : Timestamp(1468984831, 1),
                "t" : NumberLong(5326)
            },
            "optimeDate" : ISODate("2016-07-20T03:20:31Z"),
            "infoMessage" : "could not find member to sync from",
            "configVersion" : 1,
            "self" : true
        }
    ],
    "ok" : 1,
    "$gleStats" : {
        "lastOpTime" : Timestamp(0, 0),
        "electionId" : ObjectId("7fffffff00000000000014ce")
    }
}

它是如何发生的?

这是 的配置ReplicaSet,当它没问题时:

shard1:PRIMARY> rs.conf()
{
    "_id" : "shard1",
    "version" : 1,
    "protocolVersion" : NumberLong(1),
    "members" : [
        {
            "_id" : 0,
            "host" : "192.168.254.107:27023",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 1,
            "host" : "192.168.254.108:27023",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        },
        {
            "_id" : 2,
            "host" : "192.168.2.69:27023",
            "arbiterOnly" : false,
            "buildIndexes" : true,
            "hidden" : false,
            "priority" : 1,
            "tags" : {

            },
            "slaveDelay" : NumberLong(0),
            "votes" : 1
        }
    ],
    "settings" : {
        "chainingAllowed" : true,
        "heartbeatIntervalMillis" : 2000,
        "heartbeatTimeoutSecs" : 10,
        "electionTimeoutMillis" : 10000,
        "getLastErrorModes" : {

        },
        "getLastErrorDefaults" : {
            "w" : 1,
            "wtimeout" : 0
        },
        "replicaSetId" : ObjectId("577493542ff2d62af240fe4f")
    }
}
4

0 回答 0