A retrospective on the surprising costs of moving data to the Cloud
Cloud Chasm Crossed
Did you leap into the Cloud? Or did you hold back, reasonably cautious about chasing the Next Big Thing? Certainly, by now some portion of your systems are running in the Cloud, and that’s without counting all of the Cloud-native services to which you subscribed. We must be well into the Late Majority portion of the adoption curve.
The Laggards are still doubting, waiting for it all to collapse.
Now that you are in the Cloud, so is your data. You moved in some digital assets on day one. Perhaps you completed a systems migration or uploaded digital assets, images for your website, and log files for safekeeping. The data has been growing and piling up ever since.
The great benefit of running in the Cloud is that someone else is handling the physical problems. Buying and racking hardware, cabling, supplying power, and keeping things cool. Out of sight, out of mind. All you have to do now is focus on the logical bits. Do you have enough storage? Is it accessible to users? Is it safe from hackers? Are you respecting regulations? Do you have a bulletproof backup strategy? Plenty left to worry about.
While you were distracted by your worries, you might not have noticed that your mental model for the cost of keeping your data in the Cloud was optimistic. Or maybe the Cloud provider you chose was great to get started but their solutions haven’t aged so well with extended use. Is it too late to go back? And go back to what exactly?
Home Sweet Data Center
Let’s take a walk down memory lane and remember life at the turn of the 21st century.
At the dawn of the dot-com era in the late 1990s, we got good at setting up cozy (i.e., crowded) cages in data centers. The hum of cooling fans, the whir of spinning drives, the hypnotic blinky lights. The best facilities sported security to rival Fort Knox, extreme air-conditioning, fire suppression systems that could suck out all of the oxygen in a matter of minutes, and access to the fattest broadband networks the world had to offer.
At least in Silicon Valley, the cost of housing rivaled the cost of cage space. The idea of setting up a cot among the racks might have crossed a few minds. Why would we abandon this luxury?
A modest web company would need to spend millions of dollars on servers and equipment. It’s no wonder that most of the first generation of dot-coms went under. We burned cash like there was no tomorrow. And sure enough, the next day when the party ended, high-end data centers collapsed, too.
Out of the ashes, the second generation of web companies (ASPs) was more conservative with cash. Having learned the lessons, we still wanted to control everything, but we wanted to find the cheapest way to do that. Rather than renting from a fancy co-lo, often companies would rack ‘em where they worked, in a repurposed closet or under people’s desks. Once a company got traction, systems that mattered were moved into “on-premises” server rooms, with everything you’d find at a data center.
At the same time, the data centers recovered, perhaps under new management. With all of the trappings from before and some new pricing models, data centers were, and are, still a viable option for hosting your high-end systems.
In parallel with the burgeoning 2nd generation of dot-coms, Jeff Bezos seeded the first cloud operation in 2002 with his famous API-only mandate. Amazon launched AWS with EC2 in 2006. Google launched its Cloud Platform 2 years later with App Engine. And then came Project Red Dog (a.k.a., Azure), which launched in 2010.
Simultaneously, successful enterprise software companies were experiencing growing pains. We had designed monolithic Java-based, RDB-backed applications to handle all of the -ilities: flexibility, scalability, adaptability, upgrade-ability, and so on. For the largest customers, we had to push the limits in scaling the hardware to match the demand.
As we rethought the architecture for new applications, we looked for ways to avoid recreating the monolith. We kicked around the ideas of horizontal scaling across “commodity” hardware, sharded databases with top-level controllers to route queries, and microservices.
In the cloud, software runs as a monolith or in micro-services or “serverless.” Docker and Kubernetes helped us reorganize code into deployable units. Bada bing bada boom, the world had changed.
Gone Full Cloud
Here’s the place in the story where we travel by map to today. Now that many of us have gone full cloud, we have a better sense of what that means. For hobbyists and small software shops, working in the cloud is essential. In my spare time, I tinker on a website for the good of humanity (an infinite WIP), and everything I can imagine needing is available to me, more than I’ll ever use. Let’s consider the small end of the spectrum solved.
As with monolithic applications, interesting problems happen at scale. If you are in a business that depends heavily on large volumes of data—a digital media company, a research lab, a medical facility for CT scans, an AI innovator—you need a good place to store data, and a system to ensure it won’t get corrupted. You also have to worry about availability. Depending on how to manage it, the costs to store and use your data can be staggering.
Let’s build a crude model to understand the cost of hosting data on AWS. For this model, we will look at S3 and use AWS’ pricing calculator (as of May 17, 2023). We can keep it simple and still get a sense of the order of magnitude of the cost. (Your situation will vary.)
Say we run a large space telescope. The James Webb telescope produces pictures that are roughly 32 MB uncompressed and has an on-satellite capacity of 66 GB. On average, 58 GB of images get downloaded to Earth per day. In a year, that amounts to roughly 21 TB of image data. NASA could store that on AWS in an S3 bucket in Ohio (US East) for about $500 per month.
As the years go by, the Webb space paparazzi keep snapping away. The telescope team on the ground is going to look at each picture at least once. Also, to be realistic, we’d want to reduce and compress the images before putting them on the nasa.gov website. So let’s model one write of the original file downloaded from space, two reads of the file (by the human and the compression program), and one write of a web-ready image.
~1,800 files / day × 30 days = 54,000 files per month (32 MB + 3 MB) of writes per file × 54,000 files per month = 1.89 TB of writes per month 64 MB of reads per file × 54,000 files per month = ~3.46 TB of reads per month
On AWS, inbound data is free. They want to encourage storage use, so that makes sense. For the outbound data, let’s assume it’s going to the lower-cost data center in Virginia, which is half the price of the regular egress. That adds up to $655 per month.
So far, our needs are met for $1,152 per month or $13,824 per year. What about when thousands of people start looking through pictures? What about the other telescopes? What about the backups? Costs will scale in proportion to the popularity of the images.
When the cost is high enough, it makes us wonder. What would it take to keep this data on our own hardware in our own (or rented) data center? How much does a disk drive cost? Would it make sense to return to the old model of self-hosting?
DIY Cost Comparison
For a thumbnail comparison, let’s outline a possible do-it-yourself solution and put some hardware in a shopping cart. Assume we use Ceph to manage our data storage. To have sufficient redundancy, we will need about 3 times the storage required for a single copy of the data. That means providing about 73 TB of capacity.
And here is the point where the profusion of options takes us down many paths. How many of what size hard drives will optimize cost while enabling proper recovery from one or two hard-drive failures at a time? What brand of hard drive – high or low end? How about SSDs or NVMes? How should we connect the drives? What network switches and bandwidth are needed? How do we get enough power and keep everything cool?
Cost is only one of the factors. Making sure the solution works and having the expertise to keep it running well are also critical.
For this discussion, I found an easy way out. A quick search shows companies that provide complete solutions for exactly this purpose. In a single cabinet, you get 72 TB of capacity under Ceph management for “under $12,000.” Imagine going with that solution. You’d still have to pay for power, climate control, connections to the Internet, and so on. Traditional data center stuff. Depending on utilization, maybe you would break even compared to Cloud costs by sometime in the second year.
Enter the Data Control Plane
So maybe it’s not so bad to stay in the Cloud. Ultimately you get to choose how much you rely on public vs. private clouds. There is no need for a full retreat to the comfort of your cage. Applications will run anywhere, and you can access your data wherever it happens to be.
For example, if you run apps in Kubernetes, you can use Rook to attach to Ceph data clusters running in any public cloud, in your private cloud, or on bare metal under your desk. Mix and match to get the right level of accessibility and durability, and optimize your costs as your situation evolves.
The cool kids these days are talking about control planes. That’s a fancy way of imagining a 2D surface that reaches anywhere in the cloud that you need to access in a vendor-neutral way. Crossplane is leading the way for general cloud management. Rook is here to help manage the data control plane.
I like to think of it as giving us the freedom to roam the Great (control) Plains of Data, but that’s the subject of an essay that will emerge from my fingertips another day.