Skip to content

Hardware Management

Managing host machine hardware/software layers which underpin a production cloud is critical to get right. It's complex, there are lots of moving parts, and one small hiccup can cause cascading failures. This is a challenging problem and needs to be treated with respect.

Standardization

While all hardware may have different specifications or peripherals, in HyperCloud all hardware is treated the same way. Everything from management networking to a multi-socket CPU/GPU compute node will have the same BMC platform, firmware, operating system, storage and virtualization layers.

This enables a 'base OS layer to rule them all' approach, which can run natively everywhere. With this approach, bare-metal provisioning, stateless install, upgradability, maintainability and supportability all become very manageable problems, as when you do it once, you do it everywhere. For example, upgrading a node becomes as simple as rebooting.

Standard OS Layer

Statelessness

The industry lessons learned deploying resilient applications in production in the DevOps world have taken their time to trickle down to the worlds of hardware management. The state of the art in bare-metal provisioning today is still very much wrestling with vendor nuances and the ugliness that can be found when one delves into binary blobs and the proprietary firmware that can be found in the more 'embedded' parts of current server hardware.

Stateless Architecture

By standardizing on a single platform and maintaining our own out-of-band hardware and software, we can predictably manage our hardware as a fleet. This means we can provision, commission, and decommission hardware on the fly, without having to worry about state on the server itself. As a stand-alone concept this is not novel - but today it is novel and atypical at this layer in the stack.

Stateless bring-up is achieved with a set of stateful control planes which keep the cluster up, and look after the bring-up of all remaining stateless nodes within the cluster. In HyperCloud, these nodes are the interconnect nodes in each rack.

After the autonomous boot and initial health checks, a new system looks for control plane nodes to boot from. When found, the system conducts pre-boot checks of hardware and memory, pulls its image from the control plane, and begins to boot into Linux. Once booted, the correct network configuration is discovered, and the node is identified either as a compute or storage resource. It is then brought up as such and enlisted in the cloud as a new resource.

Stateless Boot

Hardened firmware and OS

The standardized HyperCloud operating system is based on GNU/Linux with libraries, toolchains, and compute, storage, and management layers.

Hardened OS Builds

This OS layer is what runs on all the host machines. Absolutely nothing that is not necessary on the host machine is included, as every line of code results in additional attack surface and maintenance burden. By taking a minimalistic approach to our host operating system we make our host machines more secure and easier to support. In addition, our OS uses FIPS-compliant versions of SSL.