SLO

Introduction to Service Level Objectives and their relationship with error budgets

May 06, 2024

∙ Paid

Service Level Objective (SLO) sets the reliability expectation for the service based on how the service consumer perceives reliability (SLI).

One of the key ideas behind SLO is to shift the expectations from 100% reliability (which is not practical, and leads to disappointment, finger-pointing, and unreasonably high cost) to a number that is reasonable (e.g. 99%).

The difference is small (e.g. 1%) but it clarifies the error budget as a way to transparently balance the cost of reliability while controlling the risks of changing a system and setting the pace of those changes. The “pace” aspect comes in handy when doing Pace Layering, for example the critical services have a stricter reliability goal than the more experimental ones.

SLOs are internal goals that help teams make data-driven decisions about reliability and prioritize engineering work, such as balancing the speed of feature delivery with the need for stability

There are three aspects that defining a SLO:

The value: usually a number like 99.9%
The compliance period (also known as “window”): usually 30 days. This is the time window which is used to count events or time-slices in the past to calculate the SLS (Service Level Status).
The boundaries (optional): for SLIs that use thresholds for what good events look like (e.g., if a latency below 800ms is considered good, the number 800ms is part of the threshold).

Note: all three are used for multi-tiered SLOs which tie to the same SLI.

SLO heavily impacts the non-functional requirements (NFR) for how the technical solution is designed, built, operated, secured, maintained, and budgeted.

While SLO gets all the publicity, the error budget is the real hero of reliability engineering. If you remember from why bother with service levels, we are trying to shift the mindset from “systems should never fail” to understanding what failure is (SLI) and how much of it can be tolerated (error budget).

Error budget is the maximum percentage of failure that the service is allowed to have in a given period. It is directly tied to the SLO. In fact, it complements the SLO:

\(Error Budget = 100 - SLO\)

For example, if the SLO is 99%, the error budget is 1%. If SLO is 99.6%, the error budget is 0.4%.

Example

Let’s start simple by defining an SLO for an API (service) which is consumed by two teams: a mobile team and a web team (service consumers).

The mobile app and web are the interface towards users. The business makes money by charging the users a monthly fee (e.g., something like Netflix).

Note: since money is involved, there may be an SLA too, but it’s out of the scope of this article.

The mobile and web team want to hold the team behind the API (service provider) accountable for keeping the API available.

The service consumers in this example perceive the reliability of the API based on its uptime (i.e., a time-based SLI).

Together with the API team they agree to use a probe to check the API availability every minute

Note: there are many ways to measure availability. Pinging an end point every minute doesn’t give the best signal but it’s easy to start with.

They decided on an SLO window of 30 days because the subscribers are charged on a monthly basis and the UX study has some data about how much failure the customers can put up with before they seriously consider cancelling their subscription.

The maximum downtime the web team can tolerate is 10 minutes per month.

But the mobile app team can tolerate a maximum of 2 hours downtime per month due to on-device caching in their implementation.

Here’s what we know:

Service = API
Service Provider = the team behind the API
Service Consumers = mobile app and web team
Service Level Indicator = Availability. Ping the /health endpoint every minute (time-based)
- Good = the number of probes that returns HTTP 200
- Valid = the number of probes sent during the compliance period
Compliance Period (SLO window) = 30 days = 43800 minutes (you can use Google for these conversions)
- Rolling window: at any given time, we look at the past 43800 minutes.
Error budget = min(10m, 120m) (the SLO needs to satisfy the requirement of the most demanding consumer.

The API team checks the historical data for their availability metrics. We call that a Service Level Status (SLS) to distinguish it from the SLI which is a metric and SLO which is an objective.

The API team checked their historical uptime and learned that in the past 90 days (129600 minutes), they had a downtime of 85 minutes. That’s what they’re comfortable to commit to (the status-quo):

\(SLS_{90d} = \frac { 129600-85 }{129600} \times 100 = 99.93\%\)

This is below the level that the web team can tolerate (10 minutes downtime in 30 days):

\(SLO_{web} = \frac { 43800-10 }{43800} \times 100 = 99.97\%\)

But it is above the level that the mobile team can tolerate (120 minutes downtime in 30 days):

\(SLO_{mobile} = \frac { 43800-120 }{43800} \times 100 = 99.72\%\)

So, we have 99.93%, 99.97%, and 9.72%. Which one should we pick?

Of course, the API team prefers the smallest SLO (99.72%) because that gives them a higher error budget.

But that SLO is below the level that the web team needs with its current implementation. What can they do?

As we discussed in the ”lagom” SLO, the web team has three options:

Be prepared for outage
Share the risk with their consumers (end users)
Negotiate with the API team

The negotiation is tricky because the service provider and service consumer have a conflict of interest.

The provider wants to maximize the error budget and may use the rule of 10x/9 to their defense.
The consumer wants to maximize the SLO and uses the paying customer experience as an argument

Finding impactful SLI and reasonable SLO can be very tricky. As a Sr Staff Engineer responsible for a large organization (100+ teams), I have covered more than half of those teams with Service Level Workshops. I’ll be sharing the steps and experience in an upcoming post.

My monetization strategy is to give away most content for free. However, these posts take anywhere from a few hours to days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. Recently I went down in working hours and salary by 10% to be able to spend more time on the newsletter. You can support me by sparing a few bucks for a paid subscription. As a token of appreciation, you get access to the Pro-Tips section as well as my book Reliability Engineering Mindset. Right now, you can get 20% off via this link.

If you find this post insightful, please share it in your circles to inspire others.

Keep reading with a 7-day free trial

Subscribe to Alex Ewerlöf Notes to keep reading this post and get 7 days of free access to the full post archives.