2

我有一个运行在 gcloud 中的 TiDB 实例,使用 tidb-ansible 脚本进行部署。我想用新的 PD 节点替换,所以我销毁并替换了那些。PD 集群现在可以正常启动了,但是当我尝试启动 TiKV 节点时,我得到了这个错误:

2018/02/28 01:42:08.091 node.rs:191: [ERROR] cluster ID mismatch: local_id 6520261967047847245 remote_id 6527407705559138241. you are trying to connect to another cluster, please reconnect to the correct PD

TiDB 常见问题解答(https://pingcap.com/docs/FAQ/)中有一个很好的错误解释:

-- 启动 TiKV 时出现集群 ID 不匹配的信息。--

这是因为本地 TiKV 中存储的集群 ID 与 PD 指定的集群 ID 不同。部署新的 PD 集群时,PD 会生成随机的集群 ID。TiKV 从 PD 获取集群 ID,并在初始化时将集群 ID 存储在本地。下次启动 TiKV 时,它会检查本地集群 ID 和 PD 中的集群 ID。如果集群 ID 不匹配,则会显示集群 ID 不匹配消息并退出 TiKV。

如果你之前部署了一个 PD 集群,然后你删除了 PD 数据并部署了一个新的 PD 集群,就会出现这个错误,因为 TiKV 使用旧数据连接到新的 PD 集群。

但是没有解释如何解决这个问题。有没有办法销毁 TiKV 实例上的本地集群 ID,以便它可以正确挂接 PD?

如果我能让他们再次交谈,PD 是否能够协调我现有的 TiKV 节点(使用现有数据)?

4

1 回答 1

2

有没有办法销毁 TiKV 实例上的本地集群 ID,以便它可以正确挂接 PD?

要让 TiKV 实例与 PD 正确挂接,您可以更改 PD 的集群 ID。请参阅以下步骤和示例。

如果我能让他们再次交谈,PD 是否能够协调我现有的 TiKV 节点(使用现有数据)?

是的,它会的。

您可以使用“ pd-recover ”来解决此问题。

  • 步骤 1. 使用集群 ID 运行新的 pd-server 6527407705559138241:。

  • 步骤 2. 将集群 ID 更改为6520261967047847245

    ./pd-recover --endpoints "http://the-new-pd-server:port" --cluster-id 6520261967047847245 --alloc-id 100000000
    
  • 步骤 3. 重启 PD 服务器。

请注意,PD 有一个 MONOTONIC 唯一 ID 分配器,即alloc-id. 所有区域 ID 和对等 ID 均由分配器生成。所以请确保您为第 2 步选择的 ID 足够大,不能超过现有 ID,否则会损坏TiKV。


例子

neil:bin/ (master) $ ./pd-server &
[1] 32718
2018/03/01 10:51:01.343 util.go:59: [info] Welcome to Placement Driver (PD).                                                                                                                                                                                                               
2018/03/01 10:51:01.343 util.go:60: [info] Release Version: 0.9.0
2018/03/01 10:51:01.343 util.go:61: [info] Git Commit Hash: 651d0dd52a46b7990d0cd74d33f2f10194d46565
2018/03/01 10:51:01.343 util.go:62: [info] Git Branch: namespace
2018/03/01 10:51:01.343 util.go:63: [info] UTC Build Time:  2017-09-13 05:30:13
2018/03/01 10:51:01.343 metricutil.go:83: [info] disable Prometheus push client
2018/03/01 10:51:01.344 server.go:87: [info] PD config - Config({FlagSet:0xc420177500 Version:false ClientUrls:http://127.0.0.1:2379 PeerUrls:http://127.0.0.1:2380 AdvertiseClientUrls:http://127.0.0.1:2379 AdvertisePeerUrls:http://127.0.0.1:2380 Name:pd DataDir:default.pd InitialCluster:pd=http://127.0.0.1:2380 InitialClusterState:new Join: LeaderLease:3 Log:{Level: Format:text DisableTimestamp:false File:{Filename: LogRotate:true MaxSize:0 MaxDays:0 MaxBackups:0}} LogFileDeprecated: LogLevelDeprecated: TsoSaveInterval:3s Metric:{PushJob:pd PushAddress: PushInterval:0s} Schedule:{MaxSnapshotCount:3 MaxStoreDownTime:1h0m0s LeaderScheduleLimit:64 RegionScheduleLimit:12 ReplicaScheduleLimit:16} Replication:{MaxReplicas:3 LocationLabels:[]} QuotaBackendBytes:0 AutoCompactionRetention:1 TickInterval:500ms ElectionInterval:3s configFile: WarningMsgs:[] nextRetryDelay:1000000000 disableStrictReconfigCheck:false})
2018/03/01 10:51:01.346 server.go:114: [info] start embed etcd
2018/03/01 10:51:01.347 log.go:84: [info] embed: [listening for peers on  http://127.0.0.1:2380]
2018/03/01 10:51:01.347 log.go:84: [info] embed: [pprof is enabled under /debug/pprof]
2018/03/01 10:51:01.347 log.go:84: [info] embed: [listening for client requests on  127.0.0.1:2379]
2018/03/01 10:51:01 systime_mon.go:11: [info] start system time monitor 
2018/03/01 10:51:01.408 log.go:84: [info] etcdserver: [name = pd]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [data dir = default.pd]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [member dir = default.pd/member]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [heartbeat = 500ms]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [election = 3000ms]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [snapshot count = 100000]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [advertise client URLs = http://127.0.0.1:2379]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [initial advertise peer URLs = http://127.0.0.1:2380]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [initial cluster = pd=http://127.0.0.1:2380]
2018/03/01 10:51:01.475 log.go:84: [info] etcdserver: [starting member b71f75320dc06a6c in cluster 1c45a069f3a1d796]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 0]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [newRaft b71f75320dc06a6c [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 1]
2018/03/01 10:51:01.587 log.go:80: [warning] auth: [simple token is not cryptographically signed]
2018/03/01 10:51:01.631 log.go:84: [info] etcdserver: [starting server... [version: 3.2.4, cluster version: to_be_decided]]
2018/03/01 10:51:01.632 log.go:84: [info] etcdserver/membership: [added member b71f75320dc06a6c [http://127.0.0.1:2380] to cluster 1c45a069f3a1d796]
2018/03/01 10:51:01.633 server.go:129: [info] create etcd v3 client with endpoints [http://127.0.0.1:2379]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c is starting a new election at term 1]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c became candidate at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c received MsgVoteResp from b71f75320dc06a6c at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c became leader at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [raft.node: b71f75320dc06a6c elected leader b71f75320dc06a6c at term 2]
2018/03/01 10:51:03.477 log.go:84: [info] etcdserver: [setting up the initial cluster version to 3.2]
2018/03/01 10:51:03.477 log.go:84: [info] etcdserver: [published {Name:pd ClientURLs:[http://127.0.0.1:2379]} to cluster 1c45a069f3a1d796]
2018/03/01 10:51:03.477 log.go:84: [info] embed: [ready to serve client requests]
2018/03/01 10:51:03.478 log.go:82: [info] embed: [serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!]
2018/03/01 10:51:03.480 etcdutil.go:125: [warning] check etcd http://127.0.0.1:2379 status, resp: &{cluster_id:2037210783374497686 member_id:13195394291058371180 revision:1 raft_term:2  3.2.4 24576 13195394291058371180 3 2}, err: <nil>, cost: 1.84566554s
2018/03/01 10:51:03.489 log.go:82: [info] etcdserver/membership: [set the initial cluster version to 3.2]
2018/03/01 10:51:03.489 log.go:84: [info] etcdserver/api: [enabled capabilities for version 3.2]
2018/03/01 10:51:03.500 server.go:174: [info] init cluster id 6527803384525484955
2018/03/01 10:51:03.579 tso.go:104: [info] sync and save timestamp: last 0001-01-01 00:00:00 +0000 UTC save 2018-03-01 10:51:06.578778001 +0800 CST
2018/03/01 10:51:03.579 leader.go:249: [info] PD cluster leader pd is ready to serve

neil:bin/ (master) $ ./pd-recover --endpoints "http://localhost:2379" --alloc-id 100000000 --cluster-id 66666666666
recover success! please restart the PD cluster
neil:bin/ (master) $ kill 32718
2018/03/01 10:51:35.258 server.go:228: [info] closing server
2018/03/01 10:51:35.258 leader.go:107: [error] campaign leader err github.com/pingcap/pd/server/leader.go:269: server closed
2018/03/01 10:51:35.258 leader.go:65: [info] server is closed, return leader loop
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver: [skipped leadership transfer for single member cluster]
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver/api/v3rpc: [grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}]
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver/api/v3rpc: [Failed to dial 127.0.0.1:2379: grpc: the connection is closing; please retry.]
2018/03/01 10:51:35.291 server.go:246: [info] close server
2018/03/01 10:51:35.291 main.go:89: [info] Got signal [15] to exit.
[1]  + 32718 done       ./pd-server
neil:bin/ (master) $ ./pd-server
2018/03/01 10:51:40.007 util.go:59: [info] Welcome to Placement Driver (PD).
2018/03/01 10:51:40.007 util.go:60: [info] Release Version: 0.9.0
2018/03/01 10:51:40.007 util.go:61: [info] Git Commit Hash: 651d0dd52a46b7990d0cd74d33f2f10194d46565
2018/03/01 10:51:40.007 util.go:62: [info] Git Branch: namespace
2018/03/01 10:51:40.007 util.go:63: [info] UTC Build Time:  2017-09-13 05:30:13
2018/03/01 10:51:40.007 metricutil.go:83: [info] disable Prometheus push client
2018/03/01 10:51:40.007 server.go:87: [info] PD config - Config({FlagSet:0xc4200771a0 Version:false ClientUrls:http://127.0.0.1:2379 PeerUrls:http://127.0.0.1:2380 AdvertiseClientUrls:http://127.0.0.1:2379 AdvertisePeerUrls:http://127.0.0.1:2380 Name:pd DataDir:default.pd InitialCluster:pd=http://127.0.0.1:2380 InitialClusterState:new Join: LeaderLease:3 Log:{Level: Format:text DisableTimestamp:false File:{Filename: LogRotate:true MaxSize:0 MaxDays:0 MaxBackups:0}} LogFileDeprecated: LogLevelDeprecated: TsoSaveInterval:3s Metric:{PushJob:pd PushAddress: PushInterval:0s} Schedule:{MaxSnapshotCount:3 MaxStoreDownTime:1h0m0s LeaderScheduleLimit:64 RegionScheduleLimit:12 ReplicaScheduleLimit:16} Replication:{MaxReplicas:3 LocationLabels:[]} QuotaBackendBytes:0 AutoCompactionRetention:1 TickInterval:500ms ElectionInterval:3s configFile: WarningMsgs:[] nextRetryDelay:1000000000 disableStrictReconfigCheck:false})
2018/03/01 10:51:40.010 server.go:114: [info] start embed etcd
2018/03/01 10:51:40 systime_mon.go:11: [info] start system time monitor 
2018/03/01 10:51:40.011 log.go:84: [info] embed: [listening for peers on  http://127.0.0.1:2380]
2018/03/01 10:51:40.011 log.go:84: [info] embed: [pprof is enabled under /debug/pprof]
2018/03/01 10:51:40.011 log.go:84: [info] embed: [listening for client requests on  127.0.0.1:2379]
2018/03/01 10:51:40.019 log.go:84: [info] etcdserver: [name = pd]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [data dir = default.pd]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [member dir = default.pd/member]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [heartbeat = 500ms]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [election = 3000ms]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [snapshot count = 100000]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [advertise client URLs = http://127.0.0.1:2379]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [restarting member b71f75320dc06a6c in cluster 1c45a069f3a1d796 at commit index 20]
2018/03/01 10:51:40.020 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 2]
2018/03/01 10:51:40.020 log.go:84: [info] raft: [newRaft b71f75320dc06a6c [peers: [], term: 2, commit: 20, applied: 0, lastindex: 20, lastterm: 2]]
2018/03/01 10:51:40.072 log.go:80: [warning] auth: [simple token is not cryptographically signed]
2018/03/01 10:51:40.113 log.go:84: [info] etcdserver: [starting server... [version: 3.2.4, cluster version: to_be_decided]]
2018/03/01 10:51:40.115 log.go:84: [info] etcdserver/membership: [added member b71f75320dc06a6c [http://127.0.0.1:2380] to cluster 1c45a069f3a1d796]
2018/03/01 10:51:40.116 etcdutil.go:62: [error] failed to get raft cluster member(s) from the given urls.
2018/03/01 10:51:40.116 server.go:129: [info] create etcd v3 client with endpoints [http://127.0.0.1:2379]
2018/03/01 10:51:40.116 log.go:82: [info] etcdserver/membership: [set the initial cluster version to 3.2]
2018/03/01 10:51:40.116 log.go:84: [info] etcdserver/api: [enabled capabilities for version 3.2]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c is starting a new election at term 2]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c became candidate at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c received MsgVoteResp from b71f75320dc06a6c at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c became leader at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [raft.node: b71f75320dc06a6c elected leader b71f75320dc06a6c at term 3]
2018/03/01 10:51:41.039 log.go:84: [info] etcdserver: [published {Name:pd ClientURLs:[http://127.0.0.1:2379]} to cluster 1c45a069f3a1d796]
2018/03/01 10:51:41.039 log.go:84: [info] embed: [ready to serve client requests]
2018/03/01 10:51:41.040 log.go:82: [info] embed: [serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!]
2018/03/01 10:51:41.066 server.go:174: [info] init cluster id 66666666666
2018/03/01 10:51:41.250 cache.go:379: [info] load 0 stores cost 465.361µs
2018/03/01 10:51:41.251 cache.go:385: [info] load 0 regions cost 426.452µs
2018/03/01 10:51:41.251 coordinator.go:123: [info] coordinator: Start collect cluster information
2018/03/01 10:51:41.251 coordinator.go:126: [info] coordinator: Cluster information is prepared
2018/03/01 10:51:41.251 coordinator.go:136: [info] coordinator: Run scheduler
2018/03/01 10:51:41.252 tso.go:104: [info] sync and save timestamp: last 0001-01-01 00:00:00 +0000 UTC save 2018-03-01 10:51:44.251760951 +0800 CST
2018/03/01 10:51:41.252 leader.go:249: [info] PD cluster leader pd is ready to serve
^C2018/03/01 10:51:56.077 server.go:228: [info] closing server
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-hot-region-scheduler stopped: context canceled
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-region-scheduler stopped: context canceled
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-leader-scheduler stopped: context canceled
2018/03/01 10:51:56.077 leader.go:107: [error] campaign leader err github.com/pingcap/pd/server/leader.go:269: server closed
2018/03/01 10:51:56.078 leader.go:65: [info] server is closed, return leader loop
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver: [skipped leadership transfer for single member cluster]
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver/api/v3rpc: [grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}]
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver/api/v3rpc: [Failed to dial 127.0.0.1:2379: grpc: the connection is closing; please retry.]
2018/03/01 10:51:56.118 server.go:246: [info] close server
2018/03/01 10:51:56.118 main.go:89: [info] Got signal [2] to exit.
于 2018-03-01T03:47:35.510 回答