Datastore Resilience
On this page
Motivation
Resilience is vital for data storage. Loss of a single component should never cause data loss. Loss of a single component should not remove our ability to determine if data has been corrupted.
Resilience in Neutron
Neutron provides resilience using a 3-way replica by default. This means that data written to Neutron always exists in three copies, enabling us to easily decide if one of those copies has become corrupt on-disk and repair it. Detecting corruption this way is vital to prevent data degradation and bit-rot in storage.
Neutron’s default datastore is created as a 3-way replica to enable smaller clusters with only 3 storage servers, but still provide resilience and the ability to detect and repair corrupt data.
3-way replica is a computationally cheap operation and provides good performance to VMs. However, a 3-way replica limits storage efficiency of the underlying devices to only 33.33% of the raw capacity.
Erasure coding
Erasure coding improves the storage efficiency and enables VMs to use a higher percentage of the underlying capacity. The caveat is that a larger number of hosts are required to ensure that resiliency is available. Erasure coding splits data into a number of chunks k
and uses these to generate additional coded chunks m
. Each of the coded m
chunks can effectively recover any of the k
chunks if one is lost.
VM Squared supports the following Erasure Coding profiles
When creating additional datastores you can elect to use 3-way replica or one of the above Erasure Coding schemes.
SoftIron recommends k + m + 1
, which allows recovery in the case where a whole storage node goes offline.
Setting m
to a value of 2 will create a system with minimum robustness and increase the risk of inaccessibility.