Recovering from failed data disk
Hard Disks and Solid State Disks are components that are likely to fail after a period of use. Ceph keeps multiple copies of all data in the cluster and can automatically recover from one or more disk failures, provided enough disks remain on all nodes to adequately distribute the data, but will reduce the total size of the pool as disks fail. This documents how to recover from this situation.
Symptoms
Ceph will report HEALTH_WARN
OR HEALTH_ERR
in ceph status with osds down
reported.
Info
If there are enough data disks remaining to redistribute the data, once data is redistributed Ceph may transition back into a HEALTH_OK
state, but ceph status will still report osds down
.
Danger
There may be data loss in a HEALTH_ERR
state! A ticket MUST be created with the HyperCloud Support Team at support@softiron.com to look into this issue before proceeding.
The Linux kernel may report I/O errors to the disk, which can be viewed by reviewing dmesg
or /var/log/all
.
Recovery
- SSH into the storage node with the failed disk.
- Suspend
ceph-automountd
from getting in the way by runningmkdir -p /var/run/ceph-automount && touch /var/run/ceph-automount/suspend
- For each OSD that has failed, run
ceph-decom-osd -failedDisk
. The OSD ID can be determined by runningceph osd tree
. - The disk should be physically removed and optionally replaced
- Run
rm /var/run/ceph-automount/suspend
to allowceph-automountd
to re-ingest the disk.