Building a “Storage Forest” on Kubernetes with Rook and Ceph

Not all nodes in Kubernetes clusters have the same attached storage. Whether it’s attached ephemeral SSD instace store volumes on EC2 or Small Board Computers (SBCs) in your Homelab, not all machines have the exact same storage attached.
Making use of all the storage available to the cluster can be a challenge. Building networked, distributed storage with a homogeneous set of SSDs or HDDs is easy thanks to Rook, but what if we built one with all the common kinds of block storage? Could Ceph handle heterogeneous distributed storage?
To find out, we will build a “Storage Forest”. A storage forest is a combination of types of storage (similar to a heterogeneous SAN), all contributing to one big pool of data for workloads to use.
Kubernetes workloads can be configured to use fast NVMe drives, slower SSDs or even HDD all inside the same underlying storage system (Ceph).
By building a storage forest, we can run workloads on different storage mediums available to the cluster to meet speed and durability needs for each workload. Applications, batch processing, OLAP, and OLTP workloads get their preferred storage via one storage solution.
How are we going to build it?
Building storage forests is easy – find some hardware, run Linux on it, install Kubernetes and Rook. In our case, this means:
- Assemble hardware or cloud resources (we’ve picked ODROID hardware, but anything that runs Kubernetes and Ceph will work!) with different kinds of storage (NVMe, SSD, HDD, etc)
- Set up the hardware and software/OS that we’ll be running (Ubuntu Server works great)
- Install base OS requirements (for Kubernetes, Rook, and any other software we want to run)
- Set up Kubernetes (k0s makes this nice and easy)
- Install Rook into the kubernetes cluster
- Set up the Custom Resource Definitions (CRDs)s so Rook can configure our Ceph cluster
- Run workloads that use our storage (via Rook
StorageClass
es)
The Hardware
We’re standing on the shoulders of giants. There are lots of individual pieces of hardware and software that make this all work. Let’s go through the big pieces that make the storage forest work.
The technology behind storing bits and bytes has progressed in a lot of ways over the decades. We want to put as many as we can in our “forest”. Just like a robust forest would have multiple species of trees, our forest will have multiple kinds of data storage mediums, all working together as one ecosystem.
Storage has come a long way since punch cards and tape. While we can’t set up absolutely every kind of storage that’s ever existed (no BetaMax tapes!), we have most of the kinds of storage in widespread use today.
Let’s go through the pieces of storage tech we’ll use, slowest first.
HDD (sequential writes at ~170MB/s)
First we have the “spinning rust” Hard Disk Drive. Invented in 1957 (!), the humble hard disk drive has powered most of the computing revolution so far, starting with modest volumes all the way up to Terrabytes of deta today.
Hard disks work by carefully magnetizing small pieces of a “platter surface” – but most importantly they are a massive improvement over storage mediums like tape because they can be used for random access. These days HDDs are everywhere – with firms like Backblaze heavily invested in managing large fleets of disks, going so far as to produce drive stats.
HDDs fit in our forest as the backbone of the forest. They’ll represent the largest chunks of data in the pool, and will be filled with commodity high storage disks. While slow, HDDs can be configured in large architectures for speed, since they are cheap to acquire, relative to faster storage mediums.
Solid State Drives (sequential writes at ~510MB/s)
The successor to the HDD is the Solid State Drive (“SSD"s) – a much more recent technology which ushered in the era of fast hard drives. A full 34 years after the advent of the HDD, SSDs provide flash memory that stores data in a semiconductor cells, to which current can be applied to store or clear data.
SSDs have very different profiles from HDD, but they started with small sizes (20MB) in the beginning as well and ended with massive data storage in the range of terabytes (at a much higher price point).
SSDs fit in our forest as the all-weather storage solution – most workloads can run quite well on SSDs, and are unlikely to push them to their limits. SSDs are perfect for workloads that benefit from fast random access, but deal with large volumes of data, and are somewhat more compute bound than I/O bound.
NVMe SSD (sequential writes at 3,300MB/s)
Non-Volatile Memory Express (NVMe) SSDs represent the current state of the art (in consumer hardware) of fast hard drives, writing data almost a full order of magnitude faster than even SSDs.
These SSDs are perfect for bursty workloads with high I/O requirements (whether random access or sequential), and high performance computing. While not quite as fast as RAM, NVMe is the closest to it.
NVMe fits in our forest as the high performance solution – workloads that need fast access to disks (where RAM wouldn’t do) should use these disks. Organizations like Let’s Encrypt run their next-gen infrasturcture on NVMe.
Compute nodes
Of course, we’ll need to hook up some compute to run the cluster. Any Small Board Computers (SBCs) will do, but I’ve chosen the ODROID, an ARM SBC built by HardKernel, sporting the Exynos series processor and Rockchip series of processors. ODROIDs are readily available (compared to the Raspberry Pi, which is quite hard to source these days!) and offer good processing power and decent OS support.
While your setup may vary, the ODROID M1 can be assembled to hold both types of SSD (NVMe and SATA SSD):
Along with the M1, a somewhat toaster-like ODROID HC4 can hold the HDDs:
The Software
Ansible
Provisioning these machines often starts with flashing an Embedded MultiMediaCard (eMMC) chip, but once the machines are up and running, we have to terraform them somehow. Despite the suggestive phrasing, we won’t be using Terraform for this bit. All the SBCs come with support for SSH once flashed, so all we need is to manage the installation of crucial software.
This is where Ansible comes in. Ansible is great at setting up machines over SSH, and since all we need is minimal setup (k0s
will be doing the heavy lifting for setting up Kubernetes) Ansible is perfect. We can assume we’ll start from a fresh “etched” eMMC with Ubuntu 20.04 from the ODROID installation guide.
Kubernetes (k0s)
Kubernetes isn’t the simplest workload orchestrator, but it’s certainly the most robust. Kubernetes orchestrates storage, networking, and workloads so we can use the storage forest we’re building. If you’re going to build a distributed storage network, a flexible upper layer of workload orchestration is key.
Setting up Kubernetes can be complex, but k0s makes it even easier than running kubeadm
, so we’ll use it as our distribution here. k0s supports ARM architecture processors, which is perfect for our hardware of choice.
Rook + Ceph
Since we’re using Kubernetes, the easiest way to deploy Ceph is with Rook.
Rook enables flexible, dynamic software-defined storage, on top of Ceph, the industry-leading F/OSS storage fabric.
Rook makes it easy to deploy networked, distributed storage – commonly on similar hardware – often called “homogenous” distributed storage.
In addiiton to being the easiest way to get a functioning Ceph cluster up, Rook provides us with the much needed flexibility to manipulate our Ceph cluster – being able to specify CephBlockPool
s and StorageClass
es with the right settings and annotations means we can target and utilize different kinds of storage much more easily.
Putting it all together
As you might imagine, the Storage forest works! Thanks to the Unix “everything in a file” maxim, with the right drivers the different storage mediums just melt into your Ceph cluster like anything else!
Check out the code (and run it for yourself) in the repository.
First we set up a BlockPool
to configure our storage – making sure to set the deviceClass
:
|
|
You can also choose to build a hybrid storage pool, in which case you’ll want to specify primaryDeviceClass
and secondaryDeviceClass
:
|
|
After setting up the BlockPool
, we can tack on a StorageClass
built for the HDDs that looks like this:
|
|
We can run a fio
workload that looks like this:
|
|
Separating workloads by StorageClass
in the same cluster is just the beginning. We can also start to use some of Ceph’s advanced features to provide even more advanced functionality:
- Flexible storage provisioning and orchestration with Rook
- RBD mirroring for moving data between storage classes
- Image Live Migration
- Object Storage (S3 & Swift compatibility)
- Shared Filesystem (similar to NFS)
- Network block device (a la
nbd
/ISCSI)
Since Ceph makes running all the storage mediums next to each other so easy, the range of what we can build is multiplicative with the backing hardware we choose.
Going Further: Adding Tape to the cluster
Any storage mechanism that can be recognized by the linux kernel is fair game for a storage forest. A great example is the venerable tape drive.
Though the color of the medium doesn’t match, tape is the “gold standard” for saving data for long periods of time. Archival-grade tape has excellent integrity and, perhaps more importantly, a price footprint that makes it attractive to people who need to start large volumes of data.
These days, services like AWS Glacier run on tape (you can even bring your own tapes!).
Tape drives can be a little hard to get a hold of without an enterprise budget, but I was lucky to find a somewhat dated HP StorageWorks Ultrium 1840:
Tape drives fit in our storage forest as the perfect place for old, long-lived archival data.
What kind of performance can we expect out of a tape drive? Well it’s a bit hard to find specs on such an old tape drive, but a few things pop up:
If the review is to be believed, we should expect 95MB/s
writes. This is obviously quite slow – but the point here is to get the long lasting durability of tape, and have it be accessible from inside the same distributed storage system.
If your Linux has a device driver for it (and you’ve got a block device under /dev
), you can add it to the forest, and Ceph can write to it!