Building a “Storage Forest” on Kubernetes with Rook and Ceph

Building a “Storage Forest” on Kubernetes with Rook and Ceph

Not all nodes in Kubernetes clusters have the same attached storage. Whether it’s attached ephemeral SSD instace store volumes on EC2 or Small Board Computers (SBCs) in your Homelab, not all machines have the exact same storage attached.

Making use of all the storage available to the cluster can be a challenge. Building networked, distributed storage with a homogeneous set of SSDs or HDDs is easy thanks to Rook, but what if we built one with all the common kinds of block storage? Could Ceph handle heterogeneous distributed storage?

To find out, we will build a “Storage Forest”. A storage forest is a combination of types of storage (similar to a heterogeneous SAN), all contributing to one big pool of data for workloads to use.

Kubernetes workloads can be configured to use fast NVMe drives, slower SSDs or even HDD all inside the same underlying storage system (Ceph).

By building a storage forest, we can run workloads on different storage mediums available to the cluster to meet speed and durability needs for each workload. Applications, batch processing, OLAP, and OLTP workloads get their preferred storage via one storage solution.

How are we going to build it?

Building storage forests is easy – find some hardware, run Linux on it, install Kubernetes and Rook. In our case, this means:

  1. Assemble hardware or cloud resources (we’ve picked ODROID hardware, but anything that runs Kubernetes and Ceph will work!) with different kinds of storage (NVMe, SSD, HDD, etc)
  2. Set up the hardware and software/OS that we’ll be running (Ubuntu Server works great)
  3. Install base OS requirements (for Kubernetes, Rook, and any other software we want to run)
  4. Set up Kubernetes (k0s makes this nice and easy)
  5. Install Rook into the kubernetes cluster
  6. Set up the Custom Resource Definitions (CRDs)s so Rook can configure our Ceph cluster
  7. Run workloads that use our storage (via Rook StorageClasses)

The Hardware

We’re standing on the shoulders of giants. There are lots of individual pieces of hardware and software that make this all work. Let’s go through the big pieces that make the storage forest work.

The technology behind storing bits and bytes has progressed in a lot of ways over the decades. We want to put as many as we can in our “forest”. Just like a robust forest would have multiple species of trees, our forest will have multiple kinds of data storage mediums, all working together as one ecosystem.

Storage has come a long way since punch cards and tape. While we can’t set up absolutely every kind of storage that’s ever existed (no BetaMax tapes!), we have most of the kinds of storage in widespread use today.

Let’s go through the pieces of storage tech we’ll use, slowest first.

HDD (sequential writes at ~170MB/s)

First we have the “spinning rust” Hard Disk Drive. Invented in 1957 (!), the humble hard disk drive has powered most of the computing revolution so far, starting with modest volumes all the way up to Terrabytes of deta today.

Hard disks work by carefully magnetizing small pieces of a “platter surface” – but most importantly they are a massive improvement over storage mediums like tape because they can be used for random access. These days HDDs are everywhere – with firms like Backblaze heavily invested in managing large fleets of disks, going so far as to produce drive stats.

HDDs fit in our forest as the backbone of the forest. They’ll represent the largest chunks of data in the pool, and will be filled with commodity high storage disks. While slow, HDDs can be configured in large architectures for speed, since they are cheap to acquire, relative to faster storage mediums.

Solid State Drives (sequential writes at ~510MB/s)

The successor to the HDD is the Solid State Drive (“SSD"s) – a much more recent technology which ushered in the era of fast hard drives. A full 34 years after the advent of the HDD, SSDs provide flash memory that stores data in a semiconductor cells, to which current can be applied to store or clear data.

SSDs have very different profiles from HDD, but they started with small sizes (20MB) in the beginning as well and ended with massive data storage in the range of terabytes (at a much higher price point).

SSDs fit in our forest as the all-weather storage solution – most workloads can run quite well on SSDs, and are unlikely to push them to their limits. SSDs are perfect for workloads that benefit from fast random access, but deal with large volumes of data, and are somewhat more compute bound than I/O bound.

NVMe SSD (sequential writes at 3,300MB/s)

Non-Volatile Memory Express (NVMe) SSDs represent the current state of the art (in consumer hardware) of fast hard drives, writing data almost a full order of magnitude faster than even SSDs.

These SSDs are perfect for bursty workloads with high I/O requirements (whether random access or sequential), and high performance computing. While not quite as fast as RAM, NVMe is the closest to it.

NVMe fits in our forest as the high performance solution – workloads that need fast access to disks (where RAM wouldn’t do) should use these disks. Organizations like Let’s Encrypt run their next-gen infrasturcture on NVMe.

Compute nodes

Of course, we’ll need to hook up some compute to run the cluster. Any Small Board Computers (SBCs) will do, but I’ve chosen the ODROID, an ARM SBC built by HardKernel, sporting the Exynos series processor and Rockchip series of processors. ODROIDs are readily available (compared to the Raspberry Pi, which is quite hard to source these days!) and offer good processing power and decent OS support.

While your setup may vary, the ODROID M1 can be assembled to hold both types of SSD (NVMe and SATA SSD):

View of the assembled ODROID M1 from the corner

View of the assembled ODROID M1 from avove

Along with the M1, a somewhat toaster-like ODROID HC4 can hold the HDDs:

View of the assembled ODROID HC4 from the corner

View of the assembled ODROID HC4 from avove

The Software

Ansible

Provisioning these machines often starts with flashing an Embedded MultiMediaCard (eMMC) chip, but once the machines are up and running, we have to terraform them somehow. Despite the suggestive phrasing, we won’t be using Terraform for this bit. All the SBCs come with support for SSH once flashed, so all we need is to manage the installation of crucial software.

This is where Ansible comes in. Ansible is great at setting up machines over SSH, and since all we need is minimal setup (k0s will be doing the heavy lifting for setting up Kubernetes) Ansible is perfect. We can assume we’ll start from a fresh “etched” eMMC with Ubuntu 20.04 from the ODROID installation guide.

Kubernetes (k0s)

Kubernetes isn’t the simplest workload orchestrator, but it’s certainly the most robust. Kubernetes orchestrates storage, networking, and workloads so we can use the storage forest we’re building. If you’re going to build a distributed storage network, a flexible upper layer of workload orchestration is key.

Setting up Kubernetes can be complex, but k0s makes it even easier than running kubeadm, so we’ll use it as our distribution here. k0s supports ARM architecture processors, which is perfect for our hardware of choice.

Rook + Ceph

Since we’re using Kubernetes, the easiest way to deploy Ceph is with Rook.

Rook enables flexible, dynamic software-defined storage, on top of Ceph, the industry-leading F/OSS storage fabric.

Rook makes it easy to deploy networked, distributed storage – commonly on similar hardware – often called “homogenous” distributed storage.

In addiiton to being the easiest way to get a functioning Ceph cluster up, Rook provides us with the much needed flexibility to manipulate our Ceph cluster – being able to specify CephBlockPools and StorageClasses with the right settings and annotations means we can target and utilize different kinds of storage much more easily.

Putting it all together

As you might imagine, the Storage forest works! Thanks to the Unix “everything in a file” maxim, with the right drivers the different storage mediums just melt into your Ceph cluster like anything else!

Check out the code (and run it for yourself) in the repository.

First we set up a BlockPool to configure our storage – making sure to set the deviceClass:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: hdd
  namespace: rook-ceph
spec:
  deviceClass: hdd
  # NOTE: with enough hosts, use a failureDomain of host for greater durability
  failureDomain: osd
  # NOTE: since there are only 2 of each kind of drive, we can only replicate 2 assuming 1 OSD per drive
  replicated:
    size: 2

You can also choose to build a hybrid storage pool, in which case you’ll want to specify primaryDeviceClass and secondaryDeviceClass:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: hybrid-nvme-ssd
  namespace: rook-ceph
spec:
  # NOTE: with enough hosts, use a failureDomain of host for greater durability
  failureDomain: osd
  # NOTE: since there are only 2 of each kind of drive, we can only replicate 2 assuming 1 OSD per drive
  replicated:
    size: 2
    hybridStorage:
      primaryDeviceClass: nvme
      secondaryDeviceClass: ssd

After setting up the BlockPool, we can tack on a StorageClass built for the HDDs that looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-block-hdd
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com
allowVolumeExpansion: true
reclaimPolicy: Delete
parameters:
  # clusterID is the namespace where the rook cluster is running
  # If you change this namespace, also change the namespace below where the secret namespaces are defined
  clusterID: rook-ceph # namespace:cluster

  # If you want to use erasure coded pool with RBD, you need to create
  # two pools. one erasure coded and one replicated.
  # You need to specify the replicated pool here in the `pool` parameter, it is
  # used for the metadata of the images.
  # The erasure coded pool must be set as the `dataPool` parameter below.
  #dataPool: ec-data-pool
  pool: hdd

  # RBD image format. Defaults to "2".
  imageFormat: "2"

  # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
  imageFeatures: layering

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster

  # Specify the filesystem type of the volume. If not specified, csi-provisioner
  # will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
  # in hyperconverged settings where the volume is mounted on the same node as the osds.
  csi.storage.k8s.io/fstype: ext4

We can run a fio workload that looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fio-hdd
  namespace: default
spec:
  storageClassName: rook-block-hdd
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 64Gi
# see: https://github.com/longhorn/dbench
---
apiVersion: batch/v1
kind: Job
metadata:
  name: fio-hdd
spec:
  backoffLimit: 0
  activeDeadlineSeconds: 600
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: dbench
        image: fbuchmeier/dbench:latest
        imagePullPolicy: Always
        # privilege needed to invalid the fs cache
        securityContext:
          privileged: true
        env:
          - name: FIO_SIZE
            value: 4G
          - name: DBENCH_MOUNTPOINT
            value: /data
          - name: FIO_DIRECT
            value: "1"
          # - name: DBENCH_QUICK
          #   value: "yes"
          # - name: FIO_OFFSET_INCREMENT
          #   value: 256M
        volumeMounts:
          - name: disk
            mountPath: /data

      volumes:
        - name: fio-hdd
          persistentVolumeClaim:
            claimName: rook-block-hdd

Separating workloads by StorageClass in the same cluster is just the beginning. We can also start to use some of Ceph’s advanced features to provide even more advanced functionality:

Since Ceph makes running all the storage mediums next to each other so easy, the range of what we can build is multiplicative with the backing hardware we choose.

Going Further: Adding Tape to the cluster

Any storage mechanism that can be recognized by the linux kernel is fair game for a storage forest. A great example is the venerable tape drive.

Though the color of the medium doesn’t match, tape is the “gold standard” for saving data for long periods of time. Archival-grade tape has excellent integrity and, perhaps more importantly, a price footprint that makes it attractive to people who need to start large volumes of data.

These days, services like AWS Glacier run on tape (you can even bring your own tapes!).

Tape drives can be a little hard to get a hold of without an enterprise budget, but I was lucky to find a somewhat dated HP StorageWorks Ultrium 1840:

Ultrium 1840

Tape drives fit in our storage forest as the perfect place for old, long-lived archival data.

What kind of performance can we expect out of a tape drive? Well it’s a bit hard to find specs on such an old tape drive, but a few things pop up:

If the review is to be believed, we should expect 95MB/s writes. This is obviously quite slow – but the point here is to get the long lasting durability of tape, and have it be accessible from inside the same distributed storage system.

If your Linux has a device driver for it (and you’ve got a block device under /dev), you can add it to the forest, and Ceph can write to it!

Victor Adossi October 03, 2022