10x/9 Rule
For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.
When setting SLOs (service level objectives) there’s a rule of thumb that goes like this:
For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.
I call it the 10x/9 (read ten exes per nine). The first time I heard it, I was suspicious, but when looking at the math and reflecting on my experience, it surprisingly holds up.
Every 9 you add
Let’s first unpack the “9” part before going to “10”.
It is common for the SLO to be only composed of 9’s. For example: 99%, 99.9%… It doesn’t have to be that way (99.5% is a perfectly valid SLO. So is 93%!) but it’s common.
When the SLO only has 9’s, it can be abbreviated like this:
“2-nines” is another way to say 99%. Error budget = 1%
“3-nines” is 99.9%. Error budget = 0.1%
“4-nines” is 99.99%. Error budget = 0.01%
“5-nines” is 99.999%. Error budget = 0.001% (realm of highly available systems)
…
I haven’t worked with any system that has more than 5-nines but theoretically it’s possible. It’s just too expensive.
Which brings us to the 10x part.
10x more reliable
So, is the system really 10x more reliable?
Let’s start simple. Suppose we have a time-based availability SLI (also known as uptime).
This is such a common SLI that there are many sites dedicated to it. Let’s use one of the simpler ones: uptime.is. You can just punch in a number and see the error budget for different periods.
99% allows for 7h 14m 41s downtime (25,920 seconds)
99.9% allows for 43m 28s downtime (2,592 seconds)
99.99% allows for 4m 21s downtime (259 seconds)
99.999% allows for 26s downtime (26 seconds)
99.9999% allows for 2.6s downtime (2.6 seconds)
…
You get the idea. Every nine we add to the SLO, allows for 10x less downtime (i.e. more reliability).
Cost of reliability
What does it take to make the system more reliable? Sometimes, it’s just a small tweak but often it has a great cost. For example:
Refactoring: The code may need to be refactored to use a more reliable algorithm. Sometimes you hit the peak of what’s possible with a programming language and you may consider rewriting the critical parts of the service in another language. That doesn’t happen overnight, the planning, learning, refactoring, migration take time and incur actual development cost.
Re-architecture: The architecture needs to change to be more scalable, available, performant, and handle harder NFRs (non-functional requirements)
Personnel: more work usually translates to more people. Maybe the company is more bloated (read how Signal, a tiny company competes with giants like Google and Meta). Maybe AI assisted tools will compensate. But generally, it’s reasonable to expect that higher reliability has a human cost as well.
Infra: The infrastructure may need more failover: beefier machines, more instances, more database replicas, etc. Redundancy and more power are the key, both of which cost money.
Vendors: Sometimes you need to change vendors to more expensive ones which are more reliable. Some other times, you may need to establish fallbacks (for example if the primary payment provider fails, you use a second payment provider). You may need better observability tooling, so you don’t drive with headlights off. Good observability tooling gets out of your way: it is faster, gathers more relevant data, integrates into many systems, stores and processes more data. All of which adds the cost of the observability provider. This trend continues to anything you touch: from the CDN provider to paging tools: more reliable vendors generally cost more because they’re built differently.
Time to resolution: You may need to improve your response time (how fast an issue triggers an alert and in turn pages the on-call person). In many countries on-call requires an extra pay (in Sweden for example, you get paid 40% more when you’re on-call regardless of if an incident happens or not, the pay is even higher over the weekend and the days before “off” days, you also get an extra paid day off). Apart from that, as the numbers above show, beyond a level, you cannot afford to have humans in the loop. For example, 5-nines allows only 26 seconds of downtime in a month! You cannot afford to have a human in the loop. In 26 seconds, you need to identify the problem, page someone in the middle of the night, wake them up, have them question their career choices, open the laptop, try to find the root cause, create a fix and ship to production. Not possible! This is the real of high availability systems where you need automatic error detection and automatic error recovery.
Process: Heavier processes may need to be in place to prevent failure: QA (quality assurance) acting as gatekeepers, Platform team acting as babysitter, management acting more at the micro-level. Before you know it, this extra friction may lead to your most skilled people leaving the door while leaving the rest struggling with politics, finger-pointing, and friction. It’s hard to put a number on it, but I think we can all agree that it has a cost for productivity.
Productivity: what’s the number of enemy of reliability? Change! Every time you change the system, it is much more likely to break.
Does adding a 9 really make it 10x more expensive? Unlike downtime, the math for cost is not as straightforward and depends on the type of product, technology, architecture, personnel and many other factors.
But one thing is clear: higher reliability is expensive. The cost needs to be weighed against the business objectives and justified by the margin you’re making from running a service.
As Alex Hidalgo states in his book Implementing Service Level Objectives:
Not only is being perfect all the time impossible, but trying to be so becomes incredibly expensive very quickly. The resources needed to edge ever closer to 100% reliability grow with a curve that is steeper than linear. Since achieving 100% perfection is impossible, you can spend infinite resources and never get there.
In practice, however, there are some hurdles. The biggest one is that senior leaders often think they should be driving their teams toward perfection (100% customer satisfaction, zero downtime, and so forth). In my experience, this is the biggest mental hill to get senior management over.
I use this meme when motivating teams to measure the right thing and commit to a reasonable objective:
It’s a joke for you, but for me it’s a memory 😄
Now it’s time to tell that story.
Story
A few years ago, I joined the direct-to-consumer division of a media company. It was basically similar to Netflix but the content was owned by the American media giant. Before I joined, the product had severe reliability issues. Going through the app reviews and user feedback, one theme was dominant: the users loved the content but hated the digital platform.
They had a reliability issue.
So, they did what any good American enterprise at their scale would do: throw money at the problem 💸💸💸
The solution? Hire SREs. That’s how I came in.
When I heard that the CTO wants 5-nines, I giggled (reminder: the service can only fail 26s in a month, and we’re talking a Netflix-type product, not some airport control tower, bank, or hospital information system). 😄
And the entire engineering org was a fraction of Netflix. But Netflix was also in the game for a long time and have been actively working with reliability.
Then it occurred to me: we have a hard uphill battle. But how could we lower his expectations?
How we did that is in the “Pro tips”. These posts take anything from a few hours to days. The main chunk of the article is available for free. 🔓 For those who spare a few bucks, the “pro tips” unlocked as a token of appreciation (Right now, you can get 20% off via this link). And for those who choose not to subscribe, it is fine too. I appreciate your time reading this far and hopefully you share it withing your circles to inspire others. 🙏You can also follow me on LinkedIn where I share useful tips about technical leadership and growth mindset.