Your AWS bill is a design document
Cost work is architecture review with a forcing function. Opening a series on cloud spend with the method behind a $500K-to-$200K bill and ~$4M/yr held flat.
The most honest description of your architecture isn’t the diagram on the wiki. It’s the bill. The diagram shows what someone intended eighteen months ago; the bill shows what’s actually running, in what quantity, right now, priced by the hour. Learning to read it that way is the most underrated skill in platform engineering.
I’ve done cost work at every scale that matters: a startup where I took a $500K annual AWS bill to about $200K while uptime went from roughly 95% to over 99.9%, GoPro, where moving transcode fleets to Spot saved $350K+ a year and S3 lifecycle policies across petabytes saved hundreds of thousands more, and Postscript today, where I’ve held ~$4M of annual spend flat year over year while traffic grew. This is the first piece in an ongoing series on that work. It starts with the method, because every specific trick expires but the method hasn’t changed in a decade.
The bill is feedback, most teams just don’t read it
Nobody designs a wasteful system on purpose. Waste arrives as the gap between decisions and their prices: an instance family chosen in a hurry, a dev environment that runs weekends, a retry loop that doubled a queue’s traffic, a debug log stream someone enabled during an incident in 2024 that now costs four figures a month. Each decision was locally reasonable. The bill is where they add up, and it’s the only place they add up, which is why cost review keeps finding architecture problems that architecture review missed.
That $500K-to-$200K rebuild is the cleanest example I have. The savings didn’t come from a cheaper contract. They came from replacing hand-built, oversized, always-on infrastructure with autoscaling groups, load balancers, and RDS sized to measured load. The bill dropped 60% and uptime went up, because the same neglect was causing both. Cost and reliability weren’t competing goals; they had a common enemy.
Attribution comes first
The first question in any cost engagement is never “what can we turn off.” It’s “who owns each dollar.” Until every major line item maps to a team, a workload, or a decision, you’re guessing, and the org will treat every proposed change as a threat because nobody can see what it touches.
Attribution is unglamorous: tagging discipline, plus occasionally unpleasant archaeology into resources whose creators have left the company. Do it anyway, for two reasons. Savings follow ownership, because a team that sees its own number moves it, and, in my experience, an engineer who can see what their design decision costs makes a different decision next quarter. That feedback loop is the whole point.
The order of operations
After attribution, the work sorts by effort against return, and the sequence is remarkably consistent across companies:
- Storage first. Lifecycle policies are close to free money at scale; petabytes of video at GoPro were sitting in the wrong storage class.
- Delete the idle. Unattached volumes, forgotten environments, load balancers pointing at nothing. Every account has this layer of sediment.
- Right-size. The fleet was sized for launch-day guesses or for the worst day of 2023. Measured load says otherwise.
- Spot and interruption tolerance. Large savings, real engineering: the transcode migration paid $350K+ a year precisely because we did the work to make interruption boring.
- Commitments last. Savings Plans and reserved capacity lock in your current shape, so buy them after the cleanup, not before. And negotiate the contract itself; I’ve sat on the customer side of those negotiations, and the answer to “is this the best you can do” is reliably no.
Flat is a victory condition
The Postscript number I’m proudest of isn’t a reduction. It’s ~$4M a year that stayed ~$4M while the platform grew. Cost work that ends with a one-time win and no controls regrows like a hedge; within 18 months the bill is back and the deck that justified the project is stale. Holding spend flat through growth means the controls are working: budgets that page someone and per-team visibility that makes regressions awkward, backed by a review cadence that treats the bill as an operational metric like latency or error rate.
That’s the standing thesis of this series: read the bill as design feedback, attribute before you optimize, take the wins in order, and build the controls that make the win permanent. Next in the series, I’ll take apart the attribution problem properly, because “just tag everything” is advice that has never once survived contact with a real AWS organization.
Questions this raises
- Where do AWS savings usually come from first?
- In rough order of effort-to-return: storage lifecycle policies, deleting unattached and idle resources, right-sizing the instances nobody has looked at since launch, Spot for interruption-tolerant work, and purchase commitments last, once usage is stable enough to commit to. The order matters because commitments lock in whatever waste you haven't cleaned up yet.
- Is Spot safe for production workloads?
- For interruption-tolerant workloads, yes, and the savings are large: moving GoPro's transcode fleets to Spot saved over $350K a year. The engineering work is making interruption genuinely tolerable (checkpointing, queue-based retry, capacity diversification), not flipping the purchase option.
- What should a cost-review engagement produce?
- Attribution (every major line item mapped to a team or workload), an executed first round of savings rather than a slide deck of recommendations, and controls — budgets, alarms, and tagging — that keep the bill from regrowing. If a cost review ends with only a report, it didn't end.
Consulting
Dealing with this on your own infrastructure?
I take contract and consulting engagements on exactly this kind of work.