Recovering from Corrupted Ceph Monitor
Warning
The procedure in the following tutorial should only be carried out in debug mode under explicit direction of the SoftIron support team.
Modern releases of Ceph, the storage backing the HyperCloud storage cluster, are very robust and support self-recovery. However, in the unlikely event that a mon is unable to come up, you may need to take manual action. There is only one instance of this occurring in a HyperCloud cluster and the circumstances were hardware failure coupled with a very old release of Ceph.
Symptoms
- Ceph will report HEALTH_WARN in ceph status with
1 mons down. -
Logging will have errors similar to the below:
starting mon.hypercloud-storage-3 rank -1 at 10.0.0.3:6789/0 mon_data /var/lib/ceph/mon/ceph-hypercloud-storage-3 fsid 00000000-0000-0000-0000-000000000000 2017-02-28 20:26:54.429986 7fde5e54c7c0 -1 obtain_monmap unable to find a monmap 2017-02-28 20:26:54.430007 7fde5e54c7c0 -1 unable to obtain a monmap: (2) No such file or directory 2017-02-28 20:26:54.436873 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 not in monmap and have been in a quorum before; must have been removed 2017-02-28 20:26:54.436884 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 commit suicide! 2017-02-28 20:26:54.436886 7fde5e54c7c0 -1 failed to initialize
Recovery
- Verify at least one other monitor is in status
upby running theceph statuscommand. - SSH to the failed monitor daemon's storage node, which can be determined by looking at the details after
mon:in the ceph status output. The name will be in the format ofhypercloud-storage- -
Look for the
ceph-monandceph-runprocesses:root@hypercloud-storage-2:/boot/static-node/storage/ceph/monitor/ceph-hypercloud-storage-2# ps -efww | grep -i ceph-mon root 3879 1 0 Sep12 ? 00:00:02 /bin/sh /bin/ceph-run /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f ceph 2903519 3879 8 16:44 ? 00:00:00 /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f root 2903571 2902870 0 16:44 pts/1 00:00:00 grep -i ceph-mon -
End these two processes:
-
Navigate to the Ceph monitor library and move this directory to a backup to allow the system to recreate the setup:
-
Remove the monitor to be replaced from the cluster monitor map:
Danger
Make sure that you only run this command on that node and no others.
-
Reboot that storage node
-
Check the status with
ceph -s. -
Once the monitor is back online, re-enable messenger protocol version 2 on the newly recreated monitor: