Service Level Indicator is a metric. Service Level Objective sets the expectation.
Fine! But how can we calculate SLI? What is the day-to-day experience of using service levels? More exactly, how do you relate the raw metric values (e.g. latency, error rate, cache hit rates, etc.) to a target that looks like 99.96%
?
There’s a conceptual gap between the metric values and SLO target.
This is where Service Level Status (SLS) comes to the picture. SLS processes the values of the service level indicator metric over time.
Then evaluates each data point according to the constraints specified in SLO and aggregates them during the window specified by SLO
The formula for SLS is exactly as the normalized SLI:
SLS has multiple advantages:
It allows keeping track of historical level of service in relation to the objective
SLS can be plotted on a diagram where the
x-axis
represents the time and they-axis
is in percentage (0-100).Each value represents the status of service level in relation to the SLO which is a horizontal line.
Here’s a sample data for a latency SLI:
Here’s the SLS for the same data in relation to a 99% SLO:
This article introduces SLS, its origin, examples, and metaphor.
The origin story
Those of you who have read Google’s [excellent] books might be scratching your head about “SLS”. That’s because I coined this term! But don’t let that put you off. There’s a story behind it.
I’m responsible for rolling out Service Levels across a relatively large organization with hundreds of internal and external services. Aware of the service level adoption obstacles, I take the time to meet each team, get to know their service, and define a service level indicator together with their consumers in the room.
At one workshop, a participant asked:
— “If the SLI is the metric, what do you call the data points of that metric?”
— “The data is the SLI”, I replied
— “But they are different things.”, he pointed at their SLI which read
Latency
and the definition of good events asrespone_latency < 500ms
.— “At any given point in time, we are counting the number of requests that were responded in less than
500ms
and divide it by total number of requests, right?”, he continued.— “Right”
— “This is different from latency. Latency can be any number from a few milliseconds to a few seconds, or even time-out. The value we are talking about is always between 0-100 (inclusive) and the SLO is a straight line”, as he grabbed a pen and drew something like this on the whiteboard:
Service level status is different from service level indicator or objective. It’s something in between!
The values are always between 0 and 100 (like SLO) regardless of the range of possible values for the SLI
It has a value for any given point in time (like SLI)
Any value below the SLO line indicates a breach
It accumulates the number of good requests over the SLO window instead of the real time latency value
The latency metric diagram changes in real-time whereas this one is is less “jumpy” due to accumulation
This metric shows the status of the Service Level. It is the Service Level Status, or SLS in short.
Calculating SLS
There’s no magic in calculating SLS. We just count the number of good events (or timeslots in case of time-based SLIs):
To calculate SLS at any given point in time, we look back and count the events.
One of the core ideas behind service levels is to put failures in the perspective of a time window. As you can see, the value of SLS is more sluggish as opposed to the raw latency values of SLI.
Not just latency
Both SLI and SLS are valuable but they tell different stories. SLI is more “raw” and can help understand the system behaviour in real time. SLS on the other hand correlates that data to the commitment towards the service consumer.
Example:
Availability
SLI: the HTTP response code for a GET request to an endpoint
SLS: the percentage of successful GET requests to that endpoint over the past 30 days
Throughput
SLI: number of cache hits
SLS: the percentage of cache hits over the past 30 days
Error rate
SLI: number of operations that did not fail
SLS: percentage of operations that did not fail over the past 30 days
Note: we just used “30 days” as an example window (also known as compliance period) but it can be any other period.
Pro-tips
My monetization strategy is to give away most content for free. However, these posts take anywhere from a few hours to a few days to ideate, draft, research, illustrate, edit, and publish. I pull these hours from my private time, vacation days and weekends.
Recently I went down in working hours and salary by 10% to be able to spend more time learning and sharing my experience with the public. You can support this cause by sparing a few bucks for a paid subscription. As a token of appreciation, you get access to the Pro-Tips sections as well as my online book Reliability Engineering Mindset. Right now, you can get 20% off via this link. You can also invite your friends to gain free access.
There’s also a referral bonus program to gain free subscriptions. Thanks in advance for helping these words reach further and impact the software engineering community.
Keep reading with a 7-day free trial
Subscribe to Alex Ewerlöf Notes to keep reading this post and get 7 days of free access to the full post archives.