Service Level Agreement
Introduction to the SLA in relation to SLI and SLO
SLA stands for Service Level Agreement. The agreement here refers to a legally binding contract not just a handshake between teams (that’s where internal SLO usually stops).
SLA is a legal layer on top of the SLO. Think of it as a more serious objective on top of a high-level indicator. It builds on top of what we have covered so far:
SLI defines how reliability is measured from the service consumers’ point of view. In the case of SLA, the consumer is usually an external entity (for example customers or end-users).
SLO sets reasonable expectations for SLI. In the case of SLA, the objective that’s legally promised externally is lower than what the service provider aims internally.
SLA takes the SLO expectations to the next level and turn them into legal contracts with accountability and penalties
The purpose of the SLA is to build trust with key service consumers and stakeholders. It shows that the service provider is serious enough to compensate for a service disruption. 🤑 SLA provides consumers with leverage to hold the service provider accountable for service degradation. 🔨
Not every consumer has an SLA though.
Google Search is an example of an important service that doesn’t have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we haven’t signed a contract with the entire world. […] Many other Google services, such as Google for Work, do have explicit SLAs with their users. — Google SRE Book, Chapter 4
SLA only makes sense when 3 conditions are met:
Measurability: service level is quantifiable (SLI) to build a model about how the service is going to be consumed
Leverage: the service consumer has leverage to hold the service provider legally accountable for breaching the contract.
Incentives: the service provider has a motivation to guarantee the service level.
Example
Let’s imagine you are a company that sells customer service chat bots. You’ve built your chat bots on top of an external cloud infrastructure provider.
Obviously, you want the infrastructure to work all the time because if the infra goes down, your chat bot is not available and it’s bad for your business. But as we argued before 100% service level is unrealistic. With SLA, you get some sort of guarantee because if the infrastructure goes down, the provider must compensate you.
SLA allows sharing risk between the service consumer and service provider.
However, beware that most service providers cannot act as an insurance provider — otherwise their fees would be too high to be competitive!
What the service providers are willing to put on the table in terms of penalties is often much less than the money you lose when your service goes down.
Your business must have a higher revenue to be able to afford the external cloud provider services (except when it’s a startup but even then, you must think about the risks). When they go down, they may give back part of what you pay them, but they definitely don’t compensate your entire loss.
If they had to compensate for every business loss for every type of business model that runs on their cloud, they’d go out of business very soon!
Structure of the SLA document
The purpose of the SLA document is to build a shared understanding with external parties about:
DoS (definition of service): What is the service
SLI and SLOs: What are the commitments of the service provider (usually formatted in a table)
Penalties: What are the penalties and incentives
Exclusions: What is the intended consumption and what is excluded from the agreement
Enforcement mechanism: What are the support channels and requirements for the service consumer to apply those penalties and incentives
Glossary: A glossary of terms is usually added for clarity including the fine print
Most SLAs follow that exact structure. A few examples:
Sometimes a company can make a special deal with a customer on top of an MSA (master subscription agreement). Examples here: Datadog, Zendesk, Salesforce.
Traditional service providers usually offer a Terms & Condition or Terms of Service instead of SLA. As we discussed before one of the key ideas behind service levels is to shift the conversation from “the systems should never fail”, to “what is failure and how much of it can you tolerate”. Besides, quantified service levels allow informed discussions about cost of reliability among other things.
SLA vs SLO
SLA builds on top of SLO. More accurately, SLA has an SLO.
Comparing SLA to SLO is not exactly comparing apples to oranges, but rather comparing apples to farmers!
SLA and SLO are related but different:
Audience: they both set expectations between a service provider and service consumer. For SLA, the consumer is often an external entity whereas for plain SLO, the consumer is often internal (other teams in the same organization (e.g., engineering teams, product managers, etc.). The SLA, however, has an SLO too.
Commitment: They both communicate measurable service level commitments towards the consumers. In the case of SLA, however, the commitment is tied to some financial penalties and entitles the consumer for compensation or legal action in case of a breach.
Expectation: They both set expectations that are set through research and negotiation. But SLA promises less reliability than SLO. For example, if internally we aim for a SLO of 99.99%, the SLA we commit externally may be 99.5%. SLA is set between legal parties. If the service provider is a large vendor, it may get away with having a cookie cutter SLA for everybody. But for small collaborations, the SLA is a product of negotiation where the consumer wants the provider to act as an insurance company while the provider wants the largest possible error budget.
Detail: They both tie an objective to an indicator, but SLA is more detailed. SLA adds a layer on top of SLI and SLO to clarify the definition of service, penalties, exceptions, and support mechanisms among other things. On the other hand, since the SLA is for external consumers, it does not contain details about your internal systems that are supposed to be abstracted away from the external consumers’ point of view.
Service Level Indicator: The SLA measurement is set up in a way that reduces liability for the service provider in case of a breach. While SLO is primarily used for service level optimization, SLA is primarily used for commitment. This has ramifications on the indicator (SLI): what you communicate publicly (SLA) should be easy to understand and measure whereas what you are trying to optimize internally (SLO) may be a nuanced and complex metric. For example, you may commit to Availability of 99.6% externally for an API SLA. But internally the team that owns the API gateway has an Availability SLO of 99.9% and the API teams may be measuring latency, data freshness, or correctness as their SLO.
Monitoring: They are both monitored and tied to alerting. SLA, however, may also be tied to a publicly visible status page to help the consumers troubleshoot their system by quickly assessing the reliability of their dependencies. For example, the Azure Status page helps their customers to roll out cloud disturbances in case they have an incident.
Optimization: They are both used to give direction and focus to service level optimization. SLA, however, is written in a way that protects the service providers’ interest. That is because when the service level is poor, the service provider already looks bad. The last thing they want to do is to add insult to the injury and put money on the table. In a future article we’ll dig into all the tricks that are used to write a solid SLA that minimizes penalties in case of a service disruption.
Boundary: They are both measured at the boundary of service providers: for SLO it might be at the boundary that the team controls, for the SLA it is at the boundary of legal entities. SLAs are often broader in scope and cover a range of services or components. They are typically defined at a higher level, encompassing multiple SLOs or service components.
Accountability: For SLO, the team tries to be responsible for variables that they can control. For the SLA, the legal entity (e.g., the company) assumes full control of what it’s accountable for. I’ve explained both approaches here. SLA is supposed to trickle down through the architecture and be translated to SLO for each system that supports it.
Method: SLO is often set in collaboration with the team (🔴in a future post I’ll write about the workshops I do across the organization to set SLOs). Unfortunately, SLAs are sometimes written by the legal team with influence from the sales team and no input from the engineering teams. This leads to commitments that are unrealistic or impossible to measure. It is important to approach the SLA the same as SLO and then add the legal layer on top of it, not the other way around.
SLO (internal commitment) should support SLA (external commitment). What you aim for internally should always be higher than what you’re promising externally. It’s wise to under-promise and over-deliver than the other way around.
Since the service provider (i.e. the company) will be punished in case of an SLA breach, it has every intention to use every trick in the bag to make it less painful.
🔴In a future article we dig into SLA tips & tricks to protect the service providers.
My monetization strategy is to give away most content for free. However, these posts take anywhere from a few hours to days to draft, edit, rethink, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. You can support me by sparing a few bucks for a paid subscription. As a bonus you get access to my WIP book Reliability Engineering Mindset. Right now, you can get 20% off via this link.
If you can’t spare the money for whatever reason, sharing this article within your circles would also help. Thanks in advance.
Such a great article, thanks for sharing! I love how related concepts are collected in the same post, especially with real-life examples from BigTech (and others).
Hej!
Thinking about SLAs they can feel quite strict to me (disclaimer: I never defined SLAs in my work but SLOs). Personally, SLO hold teams accountable, they also allow for some flexibility in reaching those goals and even changing the threshold because they are internal agreements.
Speaking of SLAs, how much experience have you had with adjusting them? Recent changes within the company, like the layoffs, might make some existing SLAs unrealistic. What are your thoughts on how we can approach these situations?
One other thing that piqued my curiosity: Have you ever worked on building cascading SLOs/SLAs? For instance, how would we define our own service level if a third-party service we depend on only offers 90% availability?
Thanks as always for your insights!