Definition of good in SLI

How to use the definition of "good" in the service level formula to focus the optimization?

Alex Ewerlöf

Aug 08, 2023

In a recent article, we discussed the service level indicator formula:

\(SLI = \frac {\text{good}} {\text{valid}} \times 100\)

Another article discussed the valid. This article talks about the definition of good.

Depending on the type of SLI:

Time-based SLI: good specifies a good time slot
Event-based SLI: good specifies a good event

There are basically 4 types of declarations for good.

1. Upper bound

This is by far the most common type of declaration for good where the value of a metric is considered good if it is below an upper threshold —denoting “good enough”.

For example if our SLI is trying to improve the latency, a sufficiently fast request can have a latency of 200ms. Or it can be 1000ms. The value here needs to be connected to something the consumer cares about and how the reliability is perceived by the consumer.

2. Lower bound

Conceptually this one is similar to the upper bound but the opposite: the value of a metric is considered good if it is above a lower threshold —denoting “good enough”.

For example, an expensive worker that consumes some queue (think Midjourney prompts on an expensive GPU), the utilization on those machines should be high.

3. Range bound

A combination of the upper bound and lower bound. If the metric value is within a range, it’s considered good.

For example, your service level depends on taking an action at a specific time, you may define a tolerate to hit that exact time. Systems that emit certain events at a pre-configured timestamp are of such nature. The system may need to emit an event every hour with a tolerance of a few seconds before or after. If the event is emitted before or after that time window, it’s considered a failure.

Another example, if you have a system that is sensitive to temperature to work optimally, you may choose both upper and lower thresholds: 30s time slices where the system temperature was between -10℃ and +43℃.

Having both thresholds is rather uncommon.

4. No bound

In this case, good defined as a subset of:

Time (for time-based SLIs). For example, if our goal is to improve the website uptime:
- good time: minutes where the site can be pinged
- valid time: all the minutes in the compliance period (eg. a month)
Events (for event-based SLIs). For example, if our goal is to improve the product purchase flow:
- good events can be the number of orders processed with a settled payment
- valid events can be the number of orders placed via the website and apps

Conclusion

Depending on the type of SLI, good either specifies good events or good time periods. See this other article for more information:

Time based vs Event based SLIs

Alex Ewerlöf

August 7, 2023

Read full story

Definition of good is also related to valid so make sure to check that article as well:

SLI: Valid vs Total

Alex Ewerlöf

August 8, 2023

Read full story

Jens Rantil

Sep 17, 2023

Actually, I don't understand the "2. lower bound". If the definition of of SLI being good/total then a lower bound means that you want _few_ good events. That doesn't make sense to me. I have always worked with upper thresholds for SLIs since I want to be above a certain ratio of "good".

Expand full comment

1 reply by Alex Ewerlöf

> For example, an expensive worker that consumes some queue (think Midjourney prompts on an expensive GPU), the utilization on those machines should be high.

Hm, I'm not entirely sure this is a good example of an SLI; I don't think the end-customer cares about "GPU utilization"... :-P

2 more comments...

Definition of good in SLI

How to use the definition of "good" in the service level formula to focus the optimization?

1. Upper bound

2. Lower bound

3. Range bound

4. No bound

Conclusion

Time based vs Event based SLIs

SLI: Valid vs Total

Discussion about this post