Recovering from Corrupted Ceph Monitor

Warning

The procedure in the following tutorial should only be carried out in debug mode under explicit direction of the SoftIron support team.

Modern releases of Ceph, the storage backing the HyperCloud storage cluster, are very robust and support self-recovery. However, in the unlikely event that a mon is unable to come up, you may need to take manual action. There is only one instance of this occurring in a HyperCloud cluster and the circumstances were hardware failure coupled with a very old release of Ceph.

Symptoms

Ceph will report HEALTH_WARN in ceph status with 1 mons down.

Logging will have errors similar to the below:

starting mon.hypercloud-storage-3 rank -1 at 10.0.0.3:6789/0 mon_data /var/lib/ceph/mon/ceph-hypercloud-storage-3 fsid 00000000-0000-0000-0000-000000000000
2017-02-28 20:26:54.429986 7fde5e54c7c0 -1 obtain_monmap unable to find a monmap
2017-02-28 20:26:54.430007 7fde5e54c7c0 -1 unable to obtain a monmap: (2) No such file or directory
2017-02-28 20:26:54.436873 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 not in monmap and have been in a quorum before; must have been removed
2017-02-28 20:26:54.436884 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 commit suicide!
2017-02-28 20:26:54.436886 7fde5e54c7c0 -1 failed to initialize

Recovery

Verify at least one other monitor is in status up by running the ceph status command.
SSH to the failed monitor daemon's storage node, which can be determined by looking at the details after mon: in the ceph status output. The name will be in the format of hypercloud-storage-

Look for the ceph-mon and ceph-run processes:

root@hypercloud-storage-2:/boot/static-node/storage/ceph/monitor/ceph-hypercloud-storage-2# ps -efww | grep -i ceph-mon
root        3879       1  0 Sep12 ?        00:00:02 /bin/sh /bin/ceph-run /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f
ceph     2903519    3879  8 16:44 ?        00:00:00 /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f
root     2903571 2902870  0 16:44 pts/1    00:00:00 grep -i ceph-mon

End these two processes:

root@hypercloud-storage-2:/boot/static-node/storage/ceph/monitor/ceph-hypercloud-storage-2# kill -9 3879 ; kill -9 2903519

Navigate to the Ceph monitor library and move this directory to a backup to allow the system to recreate the setup:

root@hypercloud-storage-2:~# cd /var/lib/ceph/mon/
root@hypercloud-storage-2:/var/lib/ceph/mon# ls
ceph-hypercloud-storage-2/
root@hypercloud-storage-2:/var/lib/ceph/mon# mv ceph-hypercloud-storage-2 ceph-hypercloud-storage-2.bak

Remove the monitor to be replaced from the cluster monitor map:

Danger

Make sure that you only run this command on that node and no others.
```
root@hypercloud-storage-2:/var/lib/ceph/mon# ceph mon remove hypercloud-storage-2
```
Reboot that storage node
```
root@hypercloud-storage-2:~# reboot
```
Check the status with ceph -s.
Once the monitor is back online, re-enable messenger protocol version 2 on the newly recreated monitor:
```
root@hypercloud-storage-2:/# ceph mon enable-msgr2
```