SLO
Introduction to Service Level Objectives and their relationship with error budgets
Service Level Objective (SLO) sets the reliability expectation for the service based on how the service consumer perceives reliability (SLI).
There are three aspects that defining a SLO:
The value: usually a number like
99.9%
The compliance period (also known as “window”): usually
30
days. This is the time window which is used to count events or time-slices in the past to calculate the SLS (Service Level Status).The boundaries (optional): for SLIs that use thresholds for what good events look like (e.g., if a latency below
800ms
is considered good, the number 800ms is part of the threshold).
Note: all three are used for multi-tiered SLOs which tie to the same SLI.
SLO heavily impacts the non-functional requirements (NFR) for how the technical solution is designed, built, operated, secured, maintained, and budgeted.
While SLO gets all the publicity, the error budget is the real hero of reliability engineering. If you remember from why bother with service levels, we are trying to shift the mindset from “systems should never fail” to understanding what failure is (SLI) and how much of it can be tolerated (error budget).
Error budget is the maximum percentage of failure that the service is allowed to have in a given period. It is directly tied to the SLO. In fact, it complements the SLO:
For example, if the SLO is 99%
, the error budget is 1%
. If SLO is 99.6%
, the error budget is 0.4%
.
Example
Let’s start simple by defining an SLO for an API (service) which is consumed by two teams: a mobile team and a web team (service consumers).
The mobile app and web are the interface towards users. The business makes money by charging the users a monthly fee (e.g., something like Netflix).
Note: since money is involved, there may be an SLA too, but it’s out of the scope of this article.
The mobile and web team want to hold the team behind the API (service provider) accountable for keeping the API available.
The service consumers in this example perceive the reliability of the API based on its uptime (i.e., a time-based SLI).
Together with the API team they agree to use a probe to check the API availability every minute
Note: there are many ways to measure availability. Pinging an end point every minute doesn’t give the best signal but it’s easy to start with.
They decided on an SLO window of 30 days
because the subscribers are charged on a monthly basis and the UX study has some data about how much failure the customers can put up with before they seriously consider cancelling their subscription.
The maximum downtime the web team can tolerate is 10 minutes
per month.
But the mobile app team can tolerate a maximum of 2 hours
downtime per month due to on-device caching in their implementation.
Here’s what we know:
Service = API
Service Provider = the team behind the API
Service Consumers = mobile app and web team
Service Level Indicator = Availability. Ping the
/health
endpoint every minute (time-based)Compliance Period (SLO window) =
30 days
=43800 minutes
(you can use Google for these conversions)Rolling window: at any given time, we look at the past
43800 minutes
.
Error budget =
min(10m, 120m)
(the SLO needs to satisfy the requirement of the most demanding consumer.
The API team checks the historical data for their availability metrics. We call that a Service Level Status (SLS) to distinguish it from the SLI which is a metric and SLO which is an objective.
The API team checked their historical uptime and learned that in the past 90 days (129600 minutes)
, they had a downtime of 85 minutes
. That’s what they’re comfortable to commit to (the status-quo):
This is below the level that the web team can tolerate (10 minutes downtime in 30 days):
But it is above the level that the mobile team can tolerate (120 minutes downtime in 30 days):
So, we have 99.93%
, 99.97%
, and 9.72%
. Which one should we pick?
Of course, the API team prefers the smallest SLO (99.72%
) because that gives them a higher error budget.
But that SLO is below the level that the web team needs with its current implementation. What can they do?
As we discussed in the ”lagom” SLO, the web team has three options:
Be prepared for outage
Share the risk with their consumers (end users)
Negotiate with the API team
The negotiation is tricky because the service provider and service consumer have a conflict of interest.
The provider wants to maximize the error budget and may use the rule of 10x/9 to their defense.
The consumer wants to maximize the SLO and uses the paying customer experience as an argument
Finding impactful SLI and reasonable SLO can be very tricky. As a Sr Staff Engineer responsible for a large organization (100+ teams), I have covered more than half of those teams with Service Level Workshops. I’ll be sharing the steps and experience in an upcoming post.
My monetization strategy is to give away most content for free. However, these posts take anywhere from a few hours to days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. Recently I went down in working hours and salary by 10% to be able to spend more time on the newsletter. You can support me by sparing a few bucks for a paid subscription. As a token of appreciation, you get access to the Pro-Tips section as well as my book Reliability Engineering Mindset. Right now, you can get 20% off via this link.