Optimizing a multimillion-dollar cloud bill
In 2019, A Cloud Guru acquired Linux Academy — along with a $2.7 million USD annual AWS bill. Cost optimization became a hot topic very quickly.
Jun 08, 2023 • 7 Minute Read
In December 2019, A Cloud Guru acquired our largest competitor, Linux Academy -- along with a $2.7 million USD annual AWS bill.
Prior to the acquisition, cloud spend was a proportionally small expense for ACG. After all, we’re the people who identified as the world’s first serverless startup and liked to show off pictures of our empty production EC2 console. At one point, our monthly office catering bill was actually higher than what we spent on AWS. So we hadn’t put a lot of effort into a systematic cloud cost control process.
But when we combined two companies’ worth of cloud spend into one bill, AWS cost optimization became a hot topic around here very quickly!
Approaching a black-box cloud bill
First, let’s acknowledge that rising cloud spend, in and of itself, isn’t necessarily a bad thing. If your gross margin is staying the same as the cloud bill increases, that might be perfectly fine - literally the cost of doing business! But if margins are decreasing as you grow, or you want to increase the margin, then it’s time to optimize spend.
So before we could do anything about reducing our bill, we knew we needed to institute some careful governance, planning, and budgeting. The gaps were clear: we had a lack of process for up-front cost analysis, undefined cost metrics and KPIs, and not a lot of visibility into our cloud spend beyond the bill itself.
The ultimate impact of these gaps? It was difficult for us to identify unit costs & drivers of spending. Without cloud financial management, our $3 million annual combined cloud spend was a black box.
Defining and measuring cost metrics and KPIs
Not only were we not sure what we were spending the most money on, but we also didn’t have a framework for determining whether we were spending too much money at all.
But we wanted to make cloud optimizations proactively, so we engaged in a cost control discovery period - seeking realtime information that would give us insights into how much we are spending, and where.
Here are some of the metrics we felt were especially important.
Cloud Cost Metric | Key Question | Insight |
Budgeted spend vs actual | How accurately are we forecasting our spend? | Drives investigation into spending and decisions to cut waste if identified |
Percentage of total cloud spend allocated by function/business unit/service area | How are we allocating spending across different departments & services? | Enables us to identify areas driving variations in budget to actual |
Percent contribution to gross margin | How much overhead does our cloud spend consume relative to the rest of the business? | Insight into whether cost growth / decline is tracking inline with revenue |
Cloud spend per active user - broken down by product line such as Hands-On Labs | How much does it cost to serve one user on our platform? | Leading indicator for gross margin. Helps us understand how user growth drives COGS |
Cloud spend per employee (i.e., developer or training architect) | How efficient is our spending on operational activities? | Drives optimization of operational costs |
We were able to answer many of these questions from available financial data. Once we came up with a set of metrics that were feasible, we established some KPIs to help us guide those metrics in the right direction. For example, we determined that we wanted to see about 80% of our EC2 spend coming from reserved instances or savings plans.
Wherever you are in the cloud adoption process, A Cloud Guru has the combination of hands-on learning features, always-fresh multicloud training, and deep expertise to make you and your enterprise cloud successful.
Establishing deep visibility into cloud spending
A Cloud Guru and Linux Academy have some fairly unique wrinkles to our cloud cost, because we allow our users to spin up their own cloud resources as part of our Hands-On Labs and Cloud Playground features. Initially, the bill for these services rolled up through the same AWS Organization as our SaaS platform, so it was difficult for us to parse where the opportunities lay for optimization.
As a first step, we reshuffled our AWS accounts so that our user-defined resources fall under a different top-level organization than the one used to run the platform. We’re also instituting an enhanced tagging strategy, along with more specific organizational units (OUs) in our AWS organizations, to further match costs to their related services and departments.
Next, we implemented real-time cost monitoring to get a better view on what we were spending and where.
Standard cost monitoring tools aren’t a great fit for us because of all the ephemeral AWS accounts our learners spin up through Cloud Playground. So we eventually created some simple tooling using AWS Lambda, CloudFormation, and Athena to parse the logs coming out of AWS Organizations and feed them into real-time Looker dashboards.
Now we could see some really interesting metrics - like hourly spending on EC2, total mix of instance reservations versus on-demand purchases, and total costs by service and organizational unit. This lets us identify and track targets for optimization.
For example, the graph above shows the effect of moving our lab accounts out of the AWS Organization that manages the Linux Academy platform -- note how the spiky traffic has been replaced with a predictable load that closely tracks with instance reservations.
Example Optimization: Savings Plans and Reservations
Once we got better visibility on our costs, the stark realization hit us that EC2 spend was now about ⅔ of the combined ACG / LA bill … and nearly all of that spend was on-demand.
As mentioned before, we have some interesting optimization challenges because we are a training platform. We specifically encourage our learners to create cloud resources on our platform -- and we cover the bill.
Check out the below graph -- you can see that our learners are spinning up all different sizes of EC2 instances in regions all over the world, following guidance from our Hands-On Labs. If we wanted you to use some new, cheaper type of instance, we couldn’t just change a configuration parameter -- we’d actually have to go update our course content to give you different instructions.
For this reason, we quickly learned to love AWS Compute Savings Plans -- they let us apply reservations to EC2 instances regardless of instance size, OS, or region. Letting AWS automate the savings for us is a huge win.
Using a combination of traditional reserved instances to cover our platform costs and savings plans for the resources our learners create, we’ve been able to get our reserved instance mix up to about 80% of our total EC2 spend -- a KPI that should reduce our EC2 bill by about 30% over the next 12 months at no impact to our users.
(Oh, and about that “serverless startup” thing … no, we don’t have plans to move all this EC2 spend to serverless anytime soon. When we look at developer time and effort to re-architect, plus training time, it’s simply not in the “quick win” bucket. But we do work to share knowledge across the team and give options to apply serverless optimizations when feasible.)
Continuous cost optimization
We’ve got lots more optimization to do here. We’re now looking at rightsizing RDS instances, our second-biggest driver of cost (turns out that the “one database per microservice” rule has some cost optimization drawbacks). We’re exploring some strategic partner options as well. And we’ll be putting more cost controls into our software development lifecycle.
We also look forward to further consolidating our reporting -- right now we track cost separately for the A Cloud Guru and Linux Academy organizations, and everybody is anticipating the great day when those two platforms (and AWS bills) become one.
But because of the financial management metrics, tooling, and monitoring we’ve implemented, we feel confident that we can optimize these areas as they surface -- while improving the overall agility of our engineering teams and the experience of our learners.