etcd - ETCD 快照恢复 + DNS 发现问题

Question

我正在尝试从快照中恢复 Amazon ECS 上的 5 节点 ETCD 集群（使用 DNS 发现），但发生的情况是每个节点都作为单节点集群启动，并且节点没有相互添加成员。

etcd 的 docker 容器内的启动脚本如下

THIS_IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
THIS_NAME=$(curl http://169.254.169.254/latest/meta-data/hostname | cut -d . -f 1)
aws s3 cp s3://test_bucket/snapshot.db .
if [ -f "./snapshot.db" ]; then
    echo "restoring from db...."
    ETCDCTL_API=3 etcdctl --data-dir ${THIS_NAME}.etcd snapshot restore snapshot.db

fi
etcd --data-dir=${THIS_NAME}.etcd --name ${THIS_NAME} --discovery-srv ${DISCOVERY_SRV}  --initial-advertise-peer-urls http://${THIS_IP}:2380 --listen-peer-urls http://0.0.0.0:2380 --advertise-client-urls http://${THIS_IP}:2380 --listen-client-urls http://0.0.0.0:2379 --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}

它的工作方式是节点名称（THIS_NAME）成为与容器的 ip 地址相关的东西，例如ip-10-0-6-22，并且私有 ip 地址（THIS_IP）是通过亚马逊 ip 元数据检索的。

日志看起来像这样

2021-04-16 10:22:07.689108 W | etcdserver: read-only range request "key:\"/runtime/corev3sit/ocbc/@shared/counterparties/saxo/last_activities_synced_time\" " with result "error:auth: invalid auth token" took too long (1m59.99874666s) to execute
2021-04-16 10:22:07.689108 W | etcdserver: read-only range request "key:\"/runtime/corev3sit/ocbc/@shared/counterparties/saxo/last_activities_synced_time\" " with result "error:auth: invalid auth token" took too long (1m59.99874666s) to execute
2021-04-16 10:15:10.691908 N | etcdserver/membership: set the initial cluster version to 3.2
2021-04-16 10:15:10.691954 I | etcdserver/api: enabled capabilities for version 3.2
2021-04-16 10:15:10.690268 I | etcdserver: setting up the initial cluster version to 3.2
2021-04-16 10:15:10.690348 I | etcdserver: published {Name:ip-10-6-0-44 ClientURLs:[http://10.6.0.44:2380]} to cluster cdf818194e3a8c32
2021-04-16 10:15:10.690581 I | embed: ready to serve client requests
2021-04-16 10:15:10.691082 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2021-04-16 10:15:10.689378 I | raft: 8e9e05c52164694d is starting a new election at term 1
2021-04-16 10:15:10.689459 I | raft: 8e9e05c52164694d became candidate at term 2
2021-04-16 10:15:10.689480 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 2
2021-04-16 10:15:10.689512 I | raft: 8e9e05c52164694d became leader at term 2
2021-04-16 10:15:10.689523 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 2
2021-04-16 10:15:09.812773 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2021-04-16 10:15:09.805550 I | etcdserver: starting server... [version: 3.2.26, cluster version: to_be_decided]
2021-04-16 10:15:09.801880 W | auth: simple token is not cryptographically signed
2021-04-16 10:15:09.788452 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 1
2021-04-16 10:15:09.788569 I | raft: 8e9e05c52164694d became follower at term 1
2021-04-16 10:15:09.788668 I | raft: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 1, commit: 1, applied: 1, lastindex: 1, lastterm: 1]
2021-04-16 10:15:09.788899 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2021-04-16 10:15:09.787344 I | etcdserver: name = ip-10-6-0-44
2021-04-16 10:15:09.787530 I | etcdserver: data dir = ip-10-6-0-44.etcd
2021-04-16 10:15:09.787634 I | etcdserver: member dir = ip-10-6-0-44.etcd/member
2021-04-16 10:15:09.787722 I | etcdserver: heartbeat = 100ms
2021-04-16 10:15:09.787780 I | etcdserver: election = 1000ms
2021-04-16 10:15:09.787858 I | etcdserver: snapshot count = 100000
2021-04-16 10:15:09.787951 I | etcdserver: advertise client URLs = http://10.6.0.44:2380
2021-04-16 10:15:09.769610 I | etcdserver: recovered store from snapshot at index 1
2021-04-16 10:15:09.768017 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2021-04-16 10:15:09.768251 I | embed: listening for peers on http://0.0.0.0:2380
2021-04-16 10:15:09.768367 I | embed: listening for client requests on 0.0.0.0:2379
2021-04-16 10:15:09.767325 I | etcdmain: etcd Version: 3.2.26
2021-04-16 10:15:09.767583 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2021-04-16 10:15:09.767670 I | etcdmain: Go Version: go1.11.6
2021-04-16 10:15:09.767783 I | etcdmain: Go OS/Arch: linux/amd64
2021-04-16 10:15:09.767844 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2021-04-16 10:15:09.745143 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
2021-04-16 15:45:09 restoring from db....
2021-04-16 15:45:09 Completed 256.0 KiB/668.0 KiB (3.5 MiB/s) with 1 file(s) remaining Completed 512.0 KiB/668.0 KiB (6.5 MiB/s) with 1 file(s) remaining Completed 668.0 KiB/668.0 KiB (8.4 MiB/s) with 1 file(s) remaining download: s3://etcd/snapshot.db to ./snapshot.db
2021-04-16 15:45:09 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 44 100 44 0 0 11000 0 --:--:-- --:--:-- --:--:-- 11000
2021-04-16 15:45:09 % Total % Received % Xferd Average Speed Time Time Time Current
2021-04-16 15:45:09 Dload Upload Total Spent Left Speed
2021-04-16 15:45:08 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 9 100 9 0 0 3000 0 --:--:-- --:--:-- --:--:-- 3000
2021-04-16 15:45:08 % Total % Received % Xferd Average Speed Time Time Time Current
2021-04-16 15:45:08 Dload Upload Total Spent Left Speed

谁能帮我解决这个问题？

score 0 · Accepted Answer

与遇到此问题的其他人分享此内容。看起来仅在初始集群设置期间才支持发现，并且需要为任何类型的节点添加和删除维护最少的服务器法定人数。

这让我只有一个选择……我从备份中创建了一个单节点 etcd 集群并用来etcdctl make-mirror完成工作。

不是恢复备份的最佳方法，但至少我没有丢失任何数据。

etcd - ETCD 快照恢复 + DNS 发现问题

1 回答 1

Related

Reference