

Discover more from Alex Ewerlöf Notes
Service level indicator (SLI) is defined as the percentage of good divided by valid:
There are two types of Service Level Indicators based on the definition of what good looks like:
Time based: aims to maximize good time
Event based: aims to maximize good events (e.g., requests that didn’t lead to errors or got responded in a reasonable time)
This article talks about these two types of SLI with some examples and when to use which.
Time Based
Time-based service level indicators show the percentage of time that the system was behaving good.
The focus is on the time period where the system was behaving correctly.
Although the definition of SLI allows being more specific, usually the valid time is the entire measurement window. In other words:
The most famous example of time base SLI is uptime: the number of minutes or seconds a particular endpoint has been up.
In reality the metric is not evaluated in real time, but rather aggregated over a time slot.
Example
Suppose that for a database we have decided to use the query response time as SLI. We define a threshold (for example one second). We measure the average query response time and every minute when this average was longer than our 1 second threshold, is considered a bad minute:
Time-based SLIs are interested in the percentage of good time.
Pros
Time based SLI is easier to measure
It is a bit more intuitive to understand
Easier to translate to error budget (eg. downtime)
Some SLIs are by nature time-based. For example, percentiles are used to focus the optimization on outliers. Percentile is calculated for a range of values over time (e.g. 5 minutes), which fits perfectly into the concept of time slots.
It is easy to tell the status based on recent data. That is because the status of each time slot is calculated separately.
Cons
Unless the system receives a uniform load (i.e. 200 req/sec for the entire month) , time and impact may not be correlated. For example, if the system was down for 30 minutes due to heavy demand right after launching a new product, this disruption is as serious as 30 minutes in the middle of the night when most users don’t use a system.
A few bad events can easily hide in an aggregation period and go under the radar. For example, if the average latency in a minute is in the “good” range, there can be requests in the same aggregation period which their latency in not in the “good” range.
The time-based SLIs usually miss the Notion of working hours and assumes a global 24/7 service. This doesn't make sense for some products. For example, if a product is supposed to be used only during working hours in a certain time zone, it doesn’t make too much sense to have on-call during night and weekends.
Event Based SLI
The focus is on the number of good events divided by the total number of events.
In the same example as before:
Example
Let’s reuse the previous example. Suppose that for a database we have decided to use the query response time as SLI. We define a threshold (for example one second) and any individual query that takes longer than one second to response is considered bad.
The diagram below shows the total number of events over time. It changes as the demand for our system fluctuates.
Event-based SLIs are interested in the percentage of good events from valid events.
Pros
They automatically adjust to the amount of load
Better map to impact. If 10x more requests got error in the same amount of time, this affects the SLI and error budget 10x.
Cons
Unlike time-based SLI, it is hard to tell the status based on recent data (there’s no time slot). We need the data for the entire evaluation window.
It is harder to translate to error budgets.
It is more punishing the team when a high number of bad events happen in a short time and can practically burn the entire error budget. One could argue that that these spikes in load are exactly why we’re measuring service levels in the first place.
Conclusion
Which one you pick boils down to a few question:
Is the reliability perceived as good time or good events?
How should the error budget be formulated?
Time-based SLIs consume the error budget based on the duration of bad time
Event-based SLIs consume the error budget based on the proportion of bad events1.
Do you want easy math? Time-based is easier in calculations.
Do you have fluctuating stats? Time-based is less punishing for the team when incidents happen during high traffic. Unfortunately, it is too punishing during low traffic failure will work against the team.
Do you want the SLI to map to the impact? Event-based is more accurate and maps better to the impact. It considers the impact of low/high load and spikes in the load.
References
Request-based and Window-based SLIs (Google Cloud SLO Monitoring)
SLI Types (IBM Instana Observability)
SRE fundamentals: SLIs, SLAs and SLOs (google cloud blog)
Uptime.is: a simple uptime calculator