Ceph在磁盘满的情况下卡住,但修复后,cephfs mds长时间卡在rejoin状态。
Ceph -s
截断输出:
cluster:
id: (deleted)
health: HEALTH_WARN
1 filesystem is degraded
services:
mon: 6 daemons, deleted
mgr: deleted(active, since 3h), standbys:
mds: fs:2/2{fs:0=mds1=up:rejoin,fs:1=mds2=up:rejoin} 1 up:standby
osd: 9 osds: 9 up (since 3h), 9 in (since 6w)
data:
pools: 10 pools, 849 pgs
objects: deleted
usage: deleted
pgs: 849 active+clean
我检查了 mds1 的日志,它说mds.0.cache failed to open ino 0x101 err -116/0
。
谁能帮我修复 mds 并使 fs 健康?
Ceph 版本:
ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)
完整的 mds 日志在这里:
2020-11-11T11:59:53.940+0800 7f1bfaad0300 0 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable), process ceph-mds, pid 1437936
2020-11-11T11:59:53.940+0800 7f1bfaad0300 1 main not setting numa affinity
2020-11-11T11:59:53.940+0800 7f1bfaad0300 0 pidfile_write: ignore empty --pid-file
2020-11-11T11:59:53.948+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250302 from mon.2
2020-11-11T11:59:54.952+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250303 from mon.2
2020-11-11T11:59:54.952+0800 7f1be9df7700 1 mds.mds1 Monitors have assigned me to become a standby.
2020-11-11T11:59:54.961+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250304 from mon.2
2020-11-11T11:59:54.961+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map i am now mds.0.250304
2020-11-11T11:59:54.961+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map state change up:boot --> up:replay
2020-11-11T11:59:54.961+0800 7f1be9df7700 1 mds.0.250304 replay_start
2020-11-11T11:59:54.961+0800 7f1be9df7700 1 mds.0.250304 recovery set is 1
2020-11-11T11:59:54.962+0800 7f1be9df7700 1 mds.0.250304 waiting for osdmap 8067 (which blacklists prior instance)
2020-11-11T11:59:54.965+0800 7f1be35ea700 -1 mds.0.openfiles _load_finish got (2) No such file or directory
2020-11-11T11:59:54.969+0800 7f1be2de9700 0 mds.0.cache creating system inode with ino:0x100
2020-11-11T11:59:54.969+0800 7f1be2de9700 0 mds.0.cache creating system inode with ino:0x1
2020-11-11T11:59:59.340+0800 7f1be1de7700 1 mds.0.250304 Finished replaying journal
2020-11-11T11:59:59.340+0800 7f1be1de7700 1 mds.0.250304 making mds journal writeable
2020-11-11T12:00:00.018+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250305 from mon.2
2020-11-11T12:00:00.018+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map i am now mds.0.250304
2020-11-11T12:00:00.019+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map state change up:replay --> up:resolve
2020-11-11T12:00:00.019+0800 7f1be9df7700 1 mds.0.250304 resolve_start
2020-11-11T12:00:00.019+0800 7f1be9df7700 1 mds.0.250304 reopen_log
2020-11-11T12:00:40.991+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250307 from mon.2
2020-11-11T12:00:40.991+0800 7f1be9df7700 1 mds.0.cache handle_mds_failure mds.1 : recovery peers are 1
2020-11-11T12:00:46.078+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250308 from mon.2
2020-11-11T12:00:46.078+0800 7f1be9df7700 1 mds.0.250304 recovery set is 1
2020-11-11T12:00:46.279+0800 7f1be9df7700 1 mds.0.250304 resolve_done
2020-11-11T12:00:47.098+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250309 from mon.2
2020-11-11T12:00:47.098+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map i am now mds.0.250304
2020-11-11T12:00:47.098+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map state change up:resolve --> up:reconnect
2020-11-11T12:00:47.098+0800 7f1be9df7700 1 mds.0.250304 reconnect_start
2020-11-11T12:00:47.098+0800 7f1be9df7700 1 mds.0.server reconnect_clients -- 20 sessions
2020-11-11T12:00:47.098+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.421063 v1:192.168.60.121:0/3417198623 after 0
2020-11-11T12:00:47.098+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754486 v1:192.168.60.112:0/2544559814 after 0
2020-11-11T12:00:47.099+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754513 v1:192.168.60.105:0/1293692070 after 0.00100002
2020-11-11T12:00:47.099+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.1225207 v1:192.168.60.91:0/3148420742 after 0.00100002
2020-11-11T12:00:47.099+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.1225703 v1:192.168.60.170:0/1268068775 after 0.00100002
2020-11-11T12:00:47.099+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754480 v1:192.168.60.102:0/2002454818 after 0.00100002
2020-11-11T12:00:47.099+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.1225690 v1:192.168.60.90:0/2591854104 after 0.00100002
2020-11-11T12:00:47.099+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754449 v1:192.168.60.109:0/1906666522 after 0.00100002
2020-11-11T12:00:47.099+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.421610 v1:192.168.60.122:0/3403538656 after 0.00100002
2020-11-11T12:00:47.100+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.396098 v1:192.168.60.10:0/1483765764 after 0.00200004
2020-11-11T12:00:47.100+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.394564 v1:192.168.60.123:0/3786388104 after 0.00200004
2020-11-11T12:00:47.100+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.424769 v1:192.168.60.120:0/10753295 after 0.00200004
2020-11-11T12:00:47.102+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.396441 v1:192.168.60.101:0/3362363763 after 0.00400008
2020-11-11T12:00:47.104+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754541 v1:192.168.60.106:0/2279833643 after 0.00600011
2020-11-11T12:00:47.105+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754519 v1:192.168.60.111:0/2462281130 after 0.00700013
2020-11-11T12:00:47.106+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754535 v1:192.168.60.110:0/3350031855 after 0.00800015
2020-11-11T12:00:47.106+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754467 v1:192.168.60.100:0/3784129623 after 0.00800015
2020-11-11T12:00:47.107+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754461 v1:192.168.60.103:0/1624035805 after 0.00900017
2020-11-11T12:00:47.108+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754473 v1:192.168.60.108:0/1815689189 after 0.0100002
2020-11-11T12:00:47.108+0800 7f1be9df7700 0 log_channel(cluster) log [DBG] : reconnect by client.754580 v1:192.168.60.104:0/681341054 after 0.0100002
2020-11-11T12:00:47.109+0800 7f1be9df7700 1 mds.0.250304 reconnect_done
2020-11-11T12:00:48.097+0800 7f1be9df7700 1 mds.mds1 Updating MDS map to version 250310 from mon.2
2020-11-11T12:00:48.097+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map i am now mds.0.250304
2020-11-11T12:00:48.097+0800 7f1be9df7700 1 mds.0.250304 handle_mds_map state change up:reconnect --> up:rejoin
2020-11-11T12:00:48.097+0800 7f1be9df7700 1 mds.0.250304 rejoin_start
2020-11-11T12:00:48.103+0800 7f1be9df7700 1 mds.0.250304 rejoin_joint_start
2020-11-11T12:00:48.110+0800 7f1be35ea700 0 mds.0.cache failed to open ino 0x101 err -116/0
期待您的帮助,谢谢!