Gardening and the art of data storage
Data storage is a lot like gardening. Perhaps you spend time outdoors doing yardwork or tending a private garden. Or maybe your only exposure to gardening is through Farmville. Either way, walk with me through this metaphor, and see data storage through a new philosophical lens.
Designing with intention
A thoughtful, well-attended garden is a source of physical and mental sustenance. In a garden, plants are established with intention, whether to grow food or raw materials for textiles, to be pleasing to look at or to offer shade, or to attract and provide for wildlife.
Greenery has a naturally calming effect on our minds. The miraculous organization of shapes and colors of flowering plants triggers happiness. Adding to the beauty is the fluttering life of bees, hummingbirds, and butterflies.
Above ground, plants scrub the air, extracting carbon dioxide and releasing oxygen. Below ground, roots hold the earth in place and acquire nutrients, while earthworms aerate the soil, and bugs take care of general maintenance.
A garden is alive and, once it is established, it just works.
Similarly, well-managed data storage sustains data in its intended form physically so that it can be used as needed by applications. Good data storage provides apps with mental sustenance, so to speak.
We design hardware systems with enough capacity for the expected volume of data. We design networks to handle the expected throughput requirements. Data management software, like Ceph and Rook, takes care of general maintenance to ensure that the data stays healthy and available. The applications that use data are put into containers that can replicate easily. Kubernetes acts as the groundskeeper, coordinating the various elements of our garden.
Sometimes you need a particular tool or experience to know what is holding the system back. When you need help, you can rely on a specialist like Koor, someone who understands how to get the ecosystem to thrive.
Three kinds of gardening and data
At my home, I do three kinds of gardening. These are similar to the three types of data that can be managed.
Outdoors, I have a vegetable garden in a raised bed. That’s a lot like object storage. I have designated areas (akin to buckets) for growing different types of crops (the objects). Beans here, carrots there, potatoes in little mounds, pumpkins on the side where they have room to spread as far as they like. Farmers do something similar, though at a much larger scale than would fit in my backyard.
Also outdoors, I have my yard, which is a lot like block storage. One can plant just about anything in a yard, and there are no limits to how it can be organized. I happen to live in California that has long dry seasons, so my yard is full of drought-tolerant plants, most of which display colorful flowers throughout the summer. Some yards are mostly grass that’s easy to mow, a few trees, and some shrubbery. Other yards might be full of moss or other ground cover. Yards in the dessert feature stones, sand, and succulents.
For the third kind, consider the plants we keep indoors. Houseplants go into their own pots, each having the right kind of soil and good drainage. If the plant grows too much, you will need to move it to a bigger pot or split it across multiple pots. This is a lot like file storage, where specific file types are managed by a file system.
The main point is that good data storage handles block, object, and file storage.
Weeds and other anomalies
For the most part, anything that grows in a garden that was not planted by a human is a weed. Nature has a way of taking over, and if we fail to intervene, a garden will “go wild.” If you are lucky, a consciencious squirrel will plant an acorn in just the right place for a shady oak to grow. However, in most cases, the natural sowing of seeds tends to result in unwanted vegetation.
Weeds are resilient, which is why you have to work to get rid of them. While you may want to allow nature to reclaim an area of land, letting weeds take over is not gardening. It’s the intention that matters.
When data deviates from its original form, a good data storage system will detect the change and restore the data. For instance, Ceph makes multiple copies of each bit of data and continuously scrubs the data, looking for and repairing inconsistencies by comparing the copies to each other. When one copy turns into a weed, Ceph turns it back.
Ceph also does periodic deep scrubbing, which is like pulling weeds out by their roots.
Outsourced vs DIY data storage
You may have recognized the title of this post as a nod to the book Zen and the Art of Motorcycle Maintenance, by Robert M. Pirsig.1 One assertion of the book that has stuck with me is that there are two kinds of motorcycle riders. One enjoys riding but does not want to be bothered to understand the inner workings of motorcycles. They prioritize enjoyment of the activity, and if a problem occurs, they will rely on a mechanic to make things right.
The other kind becomes one with the motorcycle, learning the functions of the parts and tuning their motorcycle for top performance. When things inevitably break down, they have the skills to get themselves going again.
In the context of data storage, we have different options for provisioning and managing data. One is to rely on hosted data solutions that charge “by the pound” (e.g., $X.99/TB/month) to handle everything on your behalf. You pay for some number of 9s of assurance, and you trust them to handle the details, whatever those are.
On the other hand, you can use do-it-yourself data storage and invoke self reliance to ensure that you have the same level of confidence that your data storage is resilient, scalable, accessible, secure, and so on. In the old days (like 20 years ago), that meant taking care of maintenance and redundancy yourself, often breaking a sweat to avoid mistakes. These days, set up a Ceph cluster, make it available to Kubernetes via Rook, and everything just works.
In fact, behind the scenes, those data providers are probably using Rook Ceph or a similar solution to do the heavy lifting.
To be clear, whether you use a data service or manage your own storage is not a question of moral character. Each has its place, and outsourcing can be a great strategy. You should even consider a blended strategy that matches the various needs of your systems. When you need that extra control and confidence that you can manage things yourself, the do-it-yourself approach is within reach.
Adopting open source solutions like Rook and Ceph is a great choice. The software is available for inspection and open to contributions. You can rely on the community to provide guidance and expertise. Plus, you can fork the projects and make adjustments to better suit your needs. You can even donate your improvements back to the community. That level of access can make a big difference in your success.
The gardener and the data operator
A great data management solution works like an expert gardener. Here’s a quick comparison of the person who cares for a garden (a gardener) and the software that cares for data (a data operator).
|A gardener…||A data operator…|
|prepares beds for crops or flowers||accepts new data volumes for expansion|
|weeds the garden||scrubs data to find and repair inconsistencies|
|gathers seeds for the next crop||replicates data for redundancy and resiliency|
|harvests and stores crops||archives and backs up data for future use|
|removes dead plants and debris||offlines failing storage media to prevent critical data loss|
|prunes branches to stimulate future growth||balances data across all available nodes|
|shares the bounty with the community||supplies the digital community with life-giving data|
Planning your data storage
Finally, I leave you with questions to consider when planning your data storage solution.
Begin with clear intentions
- Purpose : What is the data storage for?
- Access : Who will access the data over what networks?
- Speed : What level of slowness will people tolerate? What will be fast enough?
- Security : Does your data need to be encrypted? Does it need discrete access control?
- Growth : How quickly will the data grow? Do you have enough storage capacity now and a way to add more later?
Consider ongoing maintenance
- Resiliency : How will data integrity be ensured? Does your system self-heal corruption as it happens?
- Redundancy : Do you have multiple copies of your data? Are they kept in different locations? Do changes stay in sync automatically?
- Upgrades : How will you keep the data management software up to date?
Prepare for the worst
- Famine : What happens if multiple storage media dies at once? What happens if a data center goes out?
- Flood : What if your storage suddenly fills up? How can you be warned ahead of that happening? How can you build a levy to prevent exceeding the limits?
- Wildfires : Do you have a way to reclaim space when data is no longer needed? Clear out the dead brush, so to speak.
- Locusts : Can hungry applications take over all of the bandwidth and starve other applications? Do you have ways to measure and throttle usage?
Enjoy the results
The best part of gardening is sitting back when the work is done and watching it grow. You can see how your efforts are paying off and get new ideas for things to try next. The same is true of effective data storage. Once you set it up, it just works, giving you time to focus on other things.