Definition of good in SLI
How to use the definition of "good" in the service level formula to focus the optimization?
In a recent article, we discussed the service level indicator formula:
Another article discussed the valid. This article talks about the definition of good.
Depending on the type of SLI:
Time-based SLI: good specifies a good time slot
Event-based SLI: good specifies a good event
There are basically 4 types of declarations for good.
1. Upper bound
This is by far the most common type of declaration for good where the value of a metric is considered good if it is below an upper threshold —denoting “good enough”.
For example if our SLI is trying to improve the latency, a sufficiently fast request can have a latency of 200ms. Or it can be 1000ms. The value here needs to be connected to something the consumer cares about and how the reliability is perceived by the consumer.
2. Lower bound
Conceptually this one is similar to the upper bound but the opposite: the value of a metric is considered good if it is above a lower threshold —denoting “good enough”.
For example, an expensive worker that consumes some queue (think Midjourney prompts on an expensive GPU), the utilization on those machines should be high.
3. Range bound
A combination of the upper bound and lower bound. If the metric value is within a range, it’s considered good.
For example, your service level depends on taking an action at a specific time, you may define a tolerate to hit that exact time. Systems that emit certain events at a pre-configured timestamp are of such nature. The system may need to emit an event every hour with a tolerance of a few seconds before or after. If the event is emitted before or after that time window, it’s considered a failure.
Another example, if you have a system that is sensitive to temperature to work optimally, you may choose both upper and lower thresholds: 30s time slices where the system temperature was between -10℃ and +43℃.
Having both thresholds is rather uncommon.
4. No bound
In this case, good defined as a subset of:
Time (for time-based SLIs). For example, if our goal is to improve the website uptime:
good time: minutes where the site can be pinged
valid time: all the minutes in the compliance period (eg. a month)
Events (for event-based SLIs). For example, if our goal is to improve the product purchase flow:
good events can be the number of orders processed with a settled payment
valid events can be the number of orders placed via the website and apps
Conclusion
Depending on the type of SLI, good either specifies good events or good time periods. See this other article for more information:
Definition of good is also related to valid so make sure to check that article as well:
> For example, an expensive worker that consumes some queue (think Midjourney prompts on an expensive GPU), the utilization on those machines should be high.
Hm, I'm not entirely sure this is a good example of an SLI; I don't think the end-customer cares about "GPU utilization"... :-P
Actually, I don't understand the "2. lower bound". If the definition of of SLI being good/total then a lower bound means that you want _few_ good events. That doesn't make sense to me. I have always worked with upper thresholds for SLIs since I want to be above a certain ratio of "good".