SLS

Service Level Status

Sep 08, 2024

∙ Paid

SLS is the current, real-time, or recent state of a service's performance in relation to its SLOs (Service Level Objectives). This post explains why it’s useful, how to calculate it, and what can it do in the context of reliability engineering?

Service Level Indicator is a metric. Service Level Objective sets the expectation.

Fine! But how can we calculate SLI? What is the day-to-day experience of using service levels? More exactly, how do you relate the raw metric values (e.g. latency, error rate, cache hit rates, etc.) to a target that looks like 99.96%?

There’s a conceptual gap between the metric values and SLO target.

This is where Service Level Status (SLS) comes to the picture. SLS processes the values of the service level indicator metric over time.

It uses the definition of event, good and valid from SLI
Then evaluates each data point according to the constraints specified in SLO and aggregates them during the window specified by SLO

The formula for SLS is exactly as the normalized SLI:

\(SLS = \frac{\Sigma(\text{good})}{\Sigma(\text{valid})} \times 100 \\ \qquad \text{over the SLO window}\)

SLS has multiple advantages:

It allows keeping track of historical level of service in relation to the objective
SLS can be plotted on a diagram where the x-axis represents the time and the y-axis is in percentage (0-100).
Each value represents the status of service level in relation to the SLO which is a horizontal line.

Here’s a sample data for a latency SLI:

Here’s the SLS for the same data in relation to a 99% SLO:

This article introduces SLS, its origin, examples, and metaphor.

The origin story

Those of you who have read Google’s [excellent] books might be scratching your head about “SLS”. That’s because I coined this term! But don’t let that put you off. There’s a story behind it.

I’m responsible for rolling out Service Levels across a relatively large organization with hundreds of internal and external services. Aware of the service level adoption obstacles, I take the time to meet each team, get to know their service, and define a service level indicator together with their consumers in the room.

At one workshop, a participant asked:

— “If the SLI is the metric, what do you call the data points of that metric?”
— “The data is the SLI”, I replied
— “But they are different things.”, he pointed at their SLI which read Latency and the definition of good events as respone_latency < 500ms.
— “At any given point in time, we are counting the number of requests that were responded in less than 500ms and divide it by total number of requests, right?”, he continued.
— “Right”
— “This is different from latency. Latency can be any number from a few milliseconds to a few seconds, or even time-out. The value we are talking about is always between 0-100 (inclusive) and the SLO is a straight line”, as he grabbed a pen and drew something like this on the whiteboard:

Service level status is different from service level indicator or objective. It’s something in between!

The values are always between 0 and 100 (like SLO) regardless of the range of possible values for the SLI
It has a value for any given point in time (like SLI)
Any value below the SLO line indicates a breach
It accumulates the number of good requests over the SLO window instead of the real time latency value
The latency metric diagram changes in real-time whereas this one is is less “jumpy” due to accumulation

This metric shows the status of the Service Level. It is the Service Level Status, or SLS in short.

Calculating SLS

There’s no magic in calculating SLS. We just count the number of good events (or timeslots in case of time-based SLIs):

To calculate SLS at any given point in time, we look back and count the events.

One of the core ideas behind service levels is to put failures in the perspective of a time window. As you can see, the value of SLS is more sluggish as opposed to the raw latency values of SLI.

Not just latency

Both SLI and SLS are valuable but they tell different stories. SLI is more “raw” and can help understand the system behaviour in real time. SLS on the other hand correlates that data to the commitment towards the service consumer.

Example:

Availability
- SLI: the HTTP response code for a GET request to an endpoint
- SLS: the percentage of successful GET requests to that endpoint over the past 30 days
Throughput
- SLI: number of cache hits
- SLS: the percentage of cache hits over the past 30 days
Error rate
- SLI: number of operations that did not fail
- SLS: percentage of operations that did not fail over the past 30 days

Note: we just used “30 days” as an example window (also known as compliance period) but it can be any other period.

Pro-tips

My monetization strategy is to give away most content for free. However, these posts take anywhere from a few hours to a few days to ideate, draft, research, illustrate, edit, and publish. I pull these hours from my private time, vacation days and weekends.

Recently I went down in working hours and salary by 10% to be able to spend more time learning and sharing my experience with the public. You can support this cause by sparing a few bucks for a paid subscription. As a token of appreciation, you get access to the Pro-Tips sections as well as my online book Reliability Engineering Mindset. Right now, you can get 20% off via this link. You can also invite your friends to gain free access.

There’s also a referral bonus program to gain free subscriptions. Thanks in advance for helping these words reach further and impact the software engineering community.

If you find this post insightful, please share it in your circles to inspire others.

Keep reading with a 7-day free trial

Subscribe to Alex Ewerlöf Notes to keep reading this post and get 7 days of free access to the full post archives.