Why care about service level?
Is there really any value in setting service level indicators and objectives?
The conversation about service levels quickly jumps into:
SLI (the indicator) defines how to measure reliability meaningfully
SLO (the objective) sets reasonable expectations for reliability
SLA (the legal contract) ties the expectation to legal penalties and incentives
But let’s step back and answer why are we doing this? Why service levels is the right approach to reliability and risk in general?
More specifically: what is in it for you, as a service provider or consumer?
Measuring reliability
There are as many ways to measure reliability as there are people who want to measure:
How often does a system fail?
What’s the uptime (technical term?
How fast is the system?
How long does it take to recover?
How predictable does the system behave?
How often do we have to fire someone for breaking the system?
etc.
All of these are valid ways to measure reliability (and there are technical terms for them that we’ll discuss in upcoming articles) but without a common language, it’s hard to communicate expectations between teams that own dependent systems.
Turns out there’s a simple way to measure reliability that works across a wide range of systems. It basically works like this:
Decide what a “successful event” means and find a metric to measure it
Decide a period of time when failures may happen
Try to find out what’s the absolute worst amount of failure you can ger away with
Can’t get simpler than that! And that’s what service levels are about. It’s a simple language to measure and communicate reliability.
It’s a simple and elegant way to normalize reliability metrics across a wide range of systems from front-end applications to apps, backends, databases, load balancers, serverless functions, queues, etc.
The most common service level indicators are:
Availability with at least 4 different definitions depending on how deep you want to go
Latency and its various forms like TTBF (time to first byte), response time, LCP (Largest Contentful Paint), good old response time, etc.
Durability & correctness for databases
etc.
But why do we need to communicate reliability in the first place?
Communicate Reliability
Everyone wants a system that is 100% reliable. But unless you’re NASA launching JWST (James Web Space Telescope), you probably can’t afford the cost of such high reliability.
The truth is: reliability has a cost!
The SLO is often expressed in numbers like 99%, 99.9%, 99.99%, …
For every 9 we add, the system is roughly 10x more reliable, but its cost will also increase roughly tenfold.
You are going to need more redundancies, better error detection, better recovery, more automated tests, better tech/hardware stack and sometimes you need to rewrite the whole thing to achieve higher reliability.
All of that may also imply a higher headcount.
Beyond a certain level you cannot even afford to have humans in the loop because the time it takes to page someone and have them fix the error is higher than the acceptable time to recover from failure.
Let’s try a joke:
5-nines is in the realm of Highly Available (HA) systems like software that is used at hospitals or airport control towers. It’s definitely overkill for a media site or an online shop.
For you, this is a joke. For me, it’s a memory! 🙃
It is obvious when the reliability metrics are normalized and obvious. In reality, however, the joke may very much be someone’s life story. With a common language, we can push back on unrealistic expectations.
SLO and OKR have two things in common:
They both became popular due to Google
That “O” stands for “objective”
But they are massively different. SLO is an accurate and honest system engineering objective while OKR is an ambitious wish.
And there’s a good reason for that.
SLO is about setting expectations.
Set expectations
A common theme among the teams that I meet is that they have dependencies. Being burned enough, some of them have built tooling to monitor basic vitals of their dependencies (for example: uptime or latency). They don’t do it out of courtesy. It’s a quick way for them to assess if the alert was their fault or can they go back to sleep when they’re paged in the middle of the night? 🌃
It’s a clever workaround but is there a better way? Glad you asked!
And you guessed it: service level is the answer. One of the benefits of Service Levels is to formalize expectations between teams.
SLIs specify how the consumers of a service perceive reliability. SLOs specify what is the tolerance of unreliability.
In other words, if your dependencies commit to SLOs for the SLIs that you decide together, you can keep them accountable for that and rest assured that they’re going to optimize their systems for maximum reliability.
You have the same responsibility towards your consumers and users.
SLOs trickle down all the way from the end user to the cloud provider level. Every team sets their expectations and commits to expectations with normalized numbers between 0-100 called SLO.
And these objectives are used to shape the optimization efforts.
User Centered
One of the biggest differences between service levels and other metrics is their focus on the consumer.
Service levels are ser after how reliability is perceived by the consumers. In an upcoming article we discuss what this perception actually means but for now it suffice to say:
If your customers don’t care about it, it isn’t worth measuring. And if you are not measuring, you cannot optimize it
When setting the service levels, it is highly recommended to have the consumer of the service in the loop. An upcoming article discusses the format of the workshop that we use to discover the right service level indicators and set meaningful objectives.
Focus the optimization
Without understanding how reliability is perceived from our consumers’ perspective, any effort to optimize it is a gambling game.
We need to rewrite that Node.js service in Go or Rust to improve performance
Said an ex-colleague of mine. And he was probably right! Not only would it have been a fun project (that would look good on the CV), but it would also probably improve performance by a factor of 7x. We’re going to pretend that the cognitive load of mastering a new language and translating from one codebase to another is zero.
Can we pretend that? No product manager in their right mind is going to give 6 months to the team to go back to the refactoring cave and come out with more or less the same feature-set but in a new shiny tech stack.
They are going to need really good data to support such expensive bets.
Instead of going with gut feelings, intuition, and guess work, why not talk to your consumers to set figure out their definition of reliability (SLI) and how much is enough (SLO)?
The good thing about working with service levels is that it directs all the optimization efforts:
You become what you measure.
The downside is of course that it kills all those fun hobby projects that could be done at the company’s expense.
I’m fully aware of the risk of service levels being weaponized by management. However, I’d like to honestly invite my engineering peers to look at it as a way to use hard data to put an end to emotional discussions.
If you feel so strongly about a change that will improve the system, bring your data to the table. First identify the needle we’re trying to move. Then connect your ideas to that measurement: rearchitecture, refactoring, change of vendors, etc.
GDD vs DDD
One of the underlying assumptions of this way of working is that data can help us make better decisions.
Instead of GDD (gut-driven decisions), we use operational data and insight to make DDD (data-driven decisions).
We often either have too much data or too little. By deliberately identifying the data points that matter most, we can make informed decisions.
When things inevitably go wrong, there’s data to learn from.
And it’s not just for your team. When you communicate your reliability metrics to your consumers, if it doesn’t meet their expectations, they have 3 ways forward:
Prepare their own architecture to be ready for when your product is unreliable (e.g. using cache or fallback)
Negotiate with your product person to set a higher SLO and give time to the team to optimize for it
Bubble up the lower expectation towards their consumers by being more realistic about what they can deliver
Whatever they do, they are acting on data and facts not feelings.
It helps control the risk.
Risk control
Does your company get a panic attack every time there’s an incident? Is finger pointing and blaming seen as an insurance to see less incidents? Does the blast radius of incidents grow month after month and year after year?
That’s probably because we fail to acknowledge a simple fact: complex systems fail all the time. Service levels are about acknowledging this aspect and being prepared for them.
Failures are costly. The company may lose revenue, opportunity or credibility. Developers may lose precious time, sleep or their jobs. Failures are also a way to learn. Admittedly expensive when unnecessary, but regardless: a way to learn.
Change is the number one enemy of reliability.
As the saying goes:
If it ain't broke don't fix it
Realistically, however, we want to improve our products (add features, fix bugs, optimize efficiency) and that requires change.
How do we balance change against reliability? You guessed it: service levels.
With Service Levels we try to shift the conversation from:
It should never fail (we don’t take any risk)
To:
What is the definition of risk: SLI (risk assessment)
How much risk do we tolerate: SLO (risk management)
I’ll talk more about error budgets, which are the compliments of SLOs in a future article, but it basically boils down to this:
If the system has violated or is close to violating the reliability expectations, hold on a little while. Otherwise, change as you wish.
Those error budgets are yours. You can burn them like money.
Chubby is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region. Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.
The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later. —Marc Alvidrez, Google SRE book
Conclusion
I have too much to write about this topic that’s close to my heart, especially the more nuanced aspects of setting SLIs, SLOs and their more serious buddy, SLAs, which we didn’t even get to talk about.
But I had to start somewhere and hopefully this post was a good start.
Great article Alex, and I'm really looking forward to the next ones in the series.
I'm a big fan of SLOs as they really help de-politicise decisions around technical investments by setting clear expectations up front.
One mistake that I've seen sometimes with SLO is that we take a too narrow definition of what they should cover, focusing mostly on infrastructural requirements and expectations.
I've seen SLOs used very effectively to measure expectations higher up in the stack, such as commitments on data freshness / latency, data quality, etc.
One small final comment: OKRs were not invented at Google but at Intel. Google though is the company that contributed the most to make them popular across the tech industry.