Skip to content

Recovering from Corrupted Ceph Monitor

Modern releases of Ceph, the storage backing the HyperCloud storage cluster, are very robust and support self-recovery. However, in the unlikely event that a mon is unable to come up, you may need to take manual action. There is only one instance of this occurring in a HyperCloud cluster and the circumstances were hardware failure coupled with a very old release of Ceph.

Symptoms

  • Ceph will report HEALTH_WARN in ceph status with 1 mons down.
  • Logging will have errors similar to the below:

    starting mon.hypercloud-storage-3 rank -1 at 10.0.0.3:6789/0 mon_data /var/lib/ceph/mon/ceph-hypercloud-storage-3 fsid 00000000-0000-0000-0000-000000000000
    2017-02-28 20:26:54.429986 7fde5e54c7c0 -1 obtain_monmap unable to find a monmap
    2017-02-28 20:26:54.430007 7fde5e54c7c0 -1 unable to obtain a monmap: (2) No such file or directory
    2017-02-28 20:26:54.436873 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 not in monmap and have been in a quorum before; must have been removed
    2017-02-28 20:26:54.436884 7fde5e54c7c0 -1 mon.hypercloud-storage-3@-1(probing) e0 commit suicide!
    2017-02-28 20:26:54.436886 7fde5e54c7c0 -1 failed to initialize
    

Recovery

  1. Verify at least one other monitor is in status up by running the ceph status command.
  2. SSH to the failed monitor daemon's storage node, which can be determined by looking at the details after mon: in the ceph status output. The name will be in the format of hypercloud-storage-
  3. Look for the ceph-mon and ceph-run processes:

    root@hypercloud-storage-2:/boot/static-node/storage/ceph/monitor/ceph-hypercloud-storage-2# ps -efww | grep -i ceph-mon
    root        3879       1  0 Sep12 ?        00:00:02 /bin/sh /bin/ceph-run /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f
    ceph     2903519    3879  8 16:44 ?        00:00:00 /bin/ceph-mon -i hypercloud-storage-2 --pid-file /var/run/ceph/mon.hypercloud-storage-2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -f
    root     2903571 2902870  0 16:44 pts/1    00:00:00 grep -i ceph-mon
    
  4. End these two processes:

    root@hypercloud-storage-2:/boot/static-node/storage/ceph/monitor/ceph-hypercloud-storage-2# kill -9 3879 ; kill -9 2903519
    
  5. Navigate to the Ceph monitor library and move this directory to a backup to allow the system to recreate the setup:

    root@hypercloud-storage-2:~# cd /var/lib/ceph/mon/
    root@hypercloud-storage-2:/var/lib/ceph/mon# ls
    ceph-hypercloud-storage-2/
    root@hypercloud-storage-2:/var/lib/ceph/mon# mv ceph-hypercloud-storage-2 ceph-hypercloud-storage-2.bak
    
  6. Remove the monitor to be replaced from the cluster monitor map:

    Danger

    Make sure that you only run this command on that node and no others.

    root@hypercloud-storage-2:/var/lib/ceph/mon# ceph mon remove hypercloud-storage-2
    
  7. Reboot that storage node

    root@hypercloud-storage-2:~# reboot
    
  8. Check the status with ceph -s.

  9. Once the monitor is back online, re-enable messenger protocol version 2 on the newly recreated monitor:

    root@hypercloud-storage-2:/# ceph mon enable-msgr2