SLI: Valid vs Total

The difference between valid and total in SLI

Aug 08, 2023

In simple terms, SLI is the percentage of good. More accurately however, it is the percentage of good (event or time) from a valid set (event or time):

\(SLI = \frac {\text{good}} {\text{valid}} \times 100\)

For simplicity, sometimes the word "total" is used instead of "valid" but there are differences between the two.

This article discusses how to use those differences to your advantage.

Why “valid”?

Service level indicator guides the optimization. The definition of Valid gives a scope to that optimization.

There are a few reasons to use valid instead of total:

1️⃣ Focus the optimization effort

Trying to optimize everything at once is like solving a multi-variable equation or trying to hit a moving target.

Example: request latency

❌ Latency of all requests to our backend (API, static, etc.)

✔️ Latency of API calls excluding the health check

Another example: incident resolution time

❌ Time to resolve an incident

✔️ Time to resolve high priority incidents

Another example: website uptime

❌ Website uptime

✔️ Uptime minus the cloud provider downtime minus outage due to maintenance planned windows

2️⃣ Clarify responsibility and control

You should only be responsible for what you control. It’s only fair.

❌ User facing API success rate (our API may fail due to our dependencies)

✔️ User facing API success rate for errors that are not due to failure in our dependencies or content that the team doesn’t control

Another example:

❌ All API calls

✔️ Authenticated API calls (we don’t want to be punished for DDoS attacks or any other unintended usage of the API surface)

Note that just because something is out of our control, it doesn’t mean that we shouldn’t be prepared for the risk. See the following articles for example:

Failover

Alex Ewerlöf

July 12, 2023

Read full story

Fallback

Alex Ewerlöf

July 26, 2023

Read full story

We should also be very careful about how valid scopes down the team responsibility. For example, one team used GitHub Actions for deployment. On a rare occasion, their website outage coincided with GitHub Actions being down as well. This means for a few hours they had no way of fixing the production until GitHub Actions started functioning. Although the availability of GitHub is outside the team’s control, the choice of CI/CD very much was in their control.

Other ways that valid criteria can scope the service level optimization:

A specific endpoint
- e.g. /api
A specific type of request
- e.g. authenticated == true
- e.g. requests from Mobile Client
- e.g. HTTP GET requests
A specific resource
- e.g. article max-age < 60 sec
- e.g. static images necessary for the first meaningful paint (as opposed to videos, etc.)
A specific database.
- e.g. database == orders
A specific type of user.
- e.g. user-type == premium

It might be tempting to scope the time-based SLIs by using valid to exclude:

Planned maintanance windows
Time out of working hours (for software that is used only during working hours)

But this is not recommended, because the end users are still going to get a poor service level when their usage doesn’t map to these “expected” lower service levels.

Time-based SLIs usually use the entire SLO window as valid. In that scenario there’s no difference between valid and total.

Example: News Site

This is because the business of news is the business of speed. The company behind a media site decided that news freshness is one SLI that they want to measure and optimize for. Freshness is defined as:

New: The time it takes from when a new article is available on the site. It is calculated as the difference between “Published” timestamp in the browser and “Published” timestamp in the CMS.
Update: The time it takes from when an update to an article is visible on the site. It is calculated as the difference between “Last Update” timestamp in the browser and “Last Update” timestamp in the CMS.

Sounds easy enough but there are so many caches in between: CMS, BFF, CDN, browser, etc.

different layers of cache and what is controlled by the frontend team

The news articles are stored in the CMS (content management system). They go to the BFF (backend for frontend) before being sent to the web front-end via a CDN (content delivery network).

The front-end team is responsible for the BFF, SSR (server-side rendering) and the rendering/interactivity that runs in the browser.

They want to set an SLI for their team that:

Aligns with the higher level SLI (news freshness)

Does not punish the team for things that are outside their control (e.g. not the CMS content that’s controlled by the editorial or the user behavior)
Can give a focus to optimization by exposing the gauges and knobs that the frontend team controls

The front-end team examines how reliability is perceived and what would be the first step in their optimization efforts. They decide that instead of focusing on total end to end freshness, the part of it that is valid for their SLI is:

BFF cache and above
Articles in the breaking news section for the last 24 hours (excluding other sections like entertainment, sports, opinions, etc.)
New articles (skipping the update freshness for this quarter’s SLI).
Focus the optimization on minimizing the lead time for new articles. The editorial wants the new articles to be on the front page in less than 1 minute from when the “Publish” button is pressed on the CMS.

Putting it all together,

Valid: articles from the breaking news section which were first published less than 24 hours ago
Good: number of valid articles where the time difference between “Published” field on the web and on the API call from the CMS is below 1 minute.