Recovering from Corrupted Ceph Monitor
Modern releases of Ceph, the storage backing the HyperCloud storage cluster, are very robust and support self-recovery. However, in the unlikely event that a mon is unable to come up, you may need to take manual action. There is only one instance of this occurring in a HyperCloud cluster and the circumstances were hardware failure coupled with a very old release of Ceph.
Symptoms
- Ceph will report HEALTH_WARN in ceph status with
1 mons down
. -
Logging will have errors similar to the below:
starting mon.hypercloud-storage-3 rank -1 at 10.0.0.3:6789/0 mon_data /var/lib/ceph/mon/ceph-hypercloud-storage-3 fsid 00000000-0000-0000-0000-000000000000 2017-02-28 20:26:54.429986 7fde5e54c7c0 -1 obtain_monmap unable to find a monmap 2017-02-28 20:26:54.430007 7fde5e54c7c0 -1 unable to obtain a monmap: (2) No such file or directory 2017-02-28 20:26:54.436873 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 not in monmap and have been in a quorum before; must have been removed 2017-02-28 20:26:54.436884 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 commit suicide! 2017-02-28 20:26:54.436886 7fde5e54c7c0 -1 failed to initialize
Recovery
- Verify at least one other monitor is in status
up
by running theceph status
command. - SSH to the failed monitor daemon's storage node, which can be determined by looking at the details after
mon:
in the ceph status output. The name will be in the format ofhypercloud-storage-
-
Look for the
ceph-mon
andceph-run
processes:root@hypercloud-storage-2:/boot/static-node/storage/ceph/monitor/ceph-hypercloud-storage-2# ps -efww | grep -i ceph-mon root 3879 1 0 Sep12 ? 00:00:02 /bin/sh /bin/ceph-run /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f ceph 2903519 3879 8 16:44 ? 00:00:00 /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f root 2903571 2902870 0 16:44 pts/1 00:00:00 grep -i ceph-mon
-
End these two processes:
-
Navigate to the Ceph monitor library and move this directory to a backup to allow the system to recreate the setup:
-
Remove the monitor to be replaced from the cluster monitor map:
Danger
Make sure that you only run this command on that node and no others.
-
Reboot that storage node
-
Check the status with
ceph -s
. -
Once the monitor is back online, re-enable messenger protocol version 2 on the newly recreated monitor: