Service level adoption obstacles
Why do many organizations fail to adopt and implement service levels properly and what are the practical tips to overcome the pitfalls?
I’ve been working with Service Levels (SLI/SLO) across several businesses now. I’ve run tens of service level workshops, built a tool with lots of examples, and talked to hundreds of engineers and leaders to get them onboard with service levels. I’ve had podcast episodes, talked to other companies, and even drafted a book about it.
It is safe to say that I am invested in service levels 😄If I could sum up my experience with service levels in one sentence, that would be:
Service levels are the best tool to measure and optimize reliability of complex software systems, but they are heavily underrated.
But why?
First let’s unpack the first part of my claim. Service levels are the best tool to:
Debunk the assumption that “systems should never fail” and the inevitable panic when they do.
Acknowledging that complex systems fail all the time and shift the conversation to identifying “what is failure” (SLI) and how much of it can the consumers tolerate (error budget)
Normalize failures by using percentage to put them in the larger perspective
Shift the effort from blindly improving reliability to optimizing reliability to reach an adequate level, not more.
Map responsibility to control and use it for organization design to reduce handover and risk of things falling between the cracks.
Aligned autonomy (what goes on inside a team is no one’s business as long as the measured service meets the expectations)
Reduce alert fatigue by alerting on error budget burn rate.
Establish a culture of measurement, data-informed decision, and talking to consumers about their perception of the service level
Treating reliability as a feature to be weighed against other features considering its ROI (rule of 10x/9)
Map reliability to accountability (topic of an upcoming article)
Full ownership (and preventing broken ownership)
Blameless postmortem to learn from incidents instead of finger pointing
But, if service levels are so great, why are they so heavily underused? I have been thinking and reflecting on that topic in the past few months and here are the top reasons in my experience:
Mental shift: the basic principles of service levels are not counter-intuitive or new. Yet their proper adoption and implementation requires a mental shift. For too long the conventional wisdom was to pick some metrics (often by engineers without talking to consumers), attach some alerts to metric thresholds, and panic for every incident. This vicious cycle was empowered by a plethora of tools that were built around this workflow (if you can call it that!). In my experience the so called “juniors” who have no experience with monitoring, on-call, incident handling, are much more receptive to giving service levels a try. On the other hand, the “seniors” are often reluctant or flat out resistant to these ideas. What gives? The baggage! In the age of knowledge workers, the ability to unlearn is as important as the ability to learn. Often, we are trapped in our perceptions and fail to gain new perspectives. The mindset alone is important enough for me to spend a third of my book Reliability Engineering Mindset to its individual and collective aspects.
Jargon: SLI, SLO, SLS, SLA, error budget, Burn Rate, Valid, Threshold, Time/Event based, Composite, Multi-Tier, … Yeah! And all you wanted was informed decision to optimize service reliability 😖 I have written many articles to break down those complex topics and even trained an LLM to have an elevated discussion about these concepts. But it doesn’t take away the fact that the language of reliability and data has a learning curve, and it takes time to speak this language fluently. It’s a skill to learn but is it worth the ROI? I believe so. As AI writes more code, there’s an increasing need for people who can reason about the complex behavior of the “black box” and measure and optimize reliability because we are simultaneously becoming more dependent on these complex interconnected systems.
Math: Arguably the math for Service Levels is far inferior to what is taught in 10th grade. But still, math can be intimidating to some people. I have made an interactive tool to make the service level math more approachable. Concepts like percentiles and statistics can be counter-intuitive at first, but once you get hold of it, it makes you a better critical thinker and comfortable with data and metrics. These skills are superpowers in the age of AI-generated content and realistic deep fakes. Math isn’t about strings of Greek letters, it’s the language of the universe. One who is put off by math has bigger problems with the universe. 🌌
Dunning Kruger effect: I’ve met many professionals who claim to know these topics very well, but once you dig deeper you realize, they are full of misunderstanding, prejudice, and misconception. What surprises me are some people with SRE or DevOps in their job title. Some of them don’t know the difference between SLO and SLA (which is one of my favorite interview questions when hiring SREs)! There are so many nuances and pitfalls when implementing service levels that I could literally fill a book with it —and that’s exactly what I’m doing. What bugs me the most is resistance and arrogance. I have spent years on this topic across different environments and every now and then have a new “Aha” moment. It baffles me how some “professionals” think they have figured it out after reading a few pages on the internet, watching some videos, or giving it a half a** try (which we’ll get to).
Premature implementation: When Google releases their excellent books (as part of their strategy to build credibility and bring developers to their cloud platforms) many people got excited. What happened in reality was that they read parts of the book (Chapter 2 is a favorite) and quickly started implementing. The tooling/engineering aspect is always more attractive than the overarching cultural and mental shifts. Many teams made dashboards and set alerts. Those who didn’t have tooling started building or buying them. But the essence of reliability engineering as a cultural shift and mental model got lost somewhere along the way while we were engineering new shinier tools. Fast forward a few years and those dashboards are gathering dust and the only people who care about them are the “SREs” who implemented them.
Poor tooling: let’s face it, 21 yeas after the word SRE was coined, and 8 years after Google went big externally on the topic, the tooling built around service levels is crap at best! And I don’t mean to intimidate people in the slightest. I have tried Datadog, Grafana, Elastic, and a few others. NO ONE GOT IT RIGHT! It is as if they just wanted to check a box to add SLO as a feature to their products. SLO is not just a feature. It is THE way you add value to empower reliability engineering, reshape the organization, and build a culture of data-driven optimization. The state of tooling is so bad that companies like Nobl9 found a niche to build a business (if you’re an observability platform provider, your best bet is to buy them because they have some of the best industry thought leaders). And the native tooling you get out of the box from your favorite cloud provider is mediocre at best. The only exception is Google Cloud —for understandable reasons! I believe these concepts are too powerful to be exclusive to one provider (for comparison this is what Azure and AWS have in the market). As an industry we could do better. The precondition is to give SLOs another chance not as an abstract complex topic but as a new mindset that travels beyond tools, job titles, individual beliefs, and corporate fabricated cultural slogans. Until then, I keep pushing with whatever tool I have at my disposal.
Negative experience: I've seen service levels being weaponized by leadership, vanity metrics being measured leading to no action or worse: wrong actions. Just because someone abused a tool, it doesn't mean the tool is bad. I’ve also heard stories about when a company hires leaders from big tech (e.g., Google) who demand SLOs but end up only doing the mechanical part (tooling, measuring something, dashboards on TV, etc.) without shifting the deeper belief systems.
Unrealistic expectations: Alas, many organizations and engineers have a negative experience with transparency (and measurement in general). For SLOs to deliver their full potential, we must create a culture that is comfortable with measurement, verified impact, transparency, accountability, and full ownership. That is a much harder problem to solve. Computers are logical, fast, and accurate creatures. People are slow, error prone, and have individual interests and agendas. Throwing SLO at an organization and hoping for it to improve reliability is like throwing a gold-plated wrench into a complex machine and hoping that the wrench will find the right bolts and fix it (tip: the machine’s problem may not even be a bolt but oil!).
The SRE stigma: The concept of service levels gained popularity with Google’s SRE book. Many organizations don’t do SRE because “Platform Engineering deprecated SRE!”, or “SRE works at Google”. What is SRE really? I have identified 4 archetypes but it boils down to applying software engineering principles to operations! There’s NOTHING in the service levels as a concept that requires someone with “SRE” in their job title. They’re useful as-is by any software engineer. Measuring the right metric and tying responsibility to mandate doesn’t need a dedicated title. Neither does normalizing failure and seeing them as learning opportunities. Service levels are completely decoupled from SRE. And I say that as someone who makes a living as a Senior Staff Site Reliability Engineer. In this newsletter, I’m determined to break that stigma and make service levels approachable for all software engineers.
There’s more, but hopefully the point is clear:
If you have a service, you have a service level, whether you acknowledge, measure, and communicate it or not.
Pro tips
Let’s go through each of the obstacles above and talk about actionable recommendations to overcome them.
My monetization strategy is to give away most content for free. However, these posts take anywhere from a few hours to days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. Recently I went down in working hours and salary by 10% to be able to spend more time sharing my experience with the public. You can support this cause by sparing a few bucks for a paid subscription. As a token of appreciation, you get access to the Pro-Tips section as well as my online book Reliability Engineering Mindset. Right now, you can get 20% off via this link. You can also invite your friends to gain free access.
So how can you adopt service levels and help your organization with all the benefits that were mentioned?
I got bad news and good news.
The bad news is that it takes energy, time, and resources to overcome those obstacles. The good news is that it is possible, and once you do, you’re going to rip the benefits with a high ROI (return of investment).
Mental shift: this is by far the hardest obstacle to overcome mainly due to the fact that it requires unlearning and having an open mind to gain a new perspective. This is the main reason I’ve decided to write the book about it (Reliability Engineering Mindset). If there’s enough demand, I can also try to publish recordings from my slides on YouTube. I mainly created them to help implement service levels across a relatively large organization (1700 people). Let me know in the comments if that helps.
Jargon: once you get a hold of it, the jargon is rudimentary and inter-connected. There are some concepts (like setting alerts on SLO) which are harder to grasp but you can skip over it as it’s not a requirement to adopt service levels. You can start small. I’ve written extensively to decrypt the jargon with lots of examples and illustrations:
SLI: the metric that measures reliability from consumer’s perspective
SLO: the minimum reliability expectation from the service provider
SLS: the status of reliability at a given time
SLA: the legal consequences for unreliability
Error Budget: the amount of unreliability the consumers can tolerate
Burn Rate: the pace of unreliability
Valid: the scope of reliability
Good: the reliable consumption
Time/Event based: how reliability is perceived over time
Composite: calculating the reliability of a complex architecture
Multi-Tier: setting multiple expectations on the same reliability dimension
and I’ll be writing more about these topics as I earn more experience teaching it to my colleagues and the wider community.
Math: I have made Service Level Calculator specifically for making it easy to skip the calculations. It has tons of help snippets hiding behind “learn more” links which in turn point to relevant pages on the internet. It also gives tips and hints based on your setup to prevent easy mistakes (e.g. setting an unrealistic SLO or sluggish alert).
Dunning Kruger effect: Thinking, fast and slow is a classic and its utility goes far beyond adopting SLO. But you don’t have to read a book to understand that confidence and competence are decoupled from each other. Just look around. Nothing has damaged humanity through the history of mankind more than confident words coming out of incompetent mouths. “The more I know, the more I realize I know nothing.” ― Socrates
Premature implementation: It’s also sad to see it being underutilized. There’s no shortcut to proper usage but I have extensively written about this topic here.
Poor tooling: unfortunately, this is an area where I can’t impact too much. I have made SLC, but that’s as far as I can go without leading one of the observability tool providers. Almost all observability platforms that support SLOs have done a poor job. When I get the chance, I talk to the product managers and engineers behind those products. But my voice doesn’t penetrate that far when PM’s believe in quotes like: “If I had asked people what they wanted, they would have said faster horses”. Regardless, I think Alex Hidalgo and his company Nobl9 have built one of the best tools out there. I don’t do paid advertisement or sponsorship. Alex is a friend of mine and I have firsthand experience with their tool. The catch with Nobl9 is that it’s yet another vendor to deal with and yet another destination to send your data to. YMMV, but for me, that’s too much for too little. Ideally, I want to see proper SLO implementation in any of the popular observability platforms that are in use at my company. And let’s not forget that the mindset/cultural element is more important than tooling. You can implement SLOs properly even with subpar tooling.
Negative experience: it is sad when a powerful tool like SLO gets weaponized. The more powerful the tool, the more damage it can cause. I’ll be doing my best to shine a light on service levels through my content, and with your help it can reach more eyeballs. An important aspect of implementing SLOs is to map the accountability that exists in the org structure to the reliability of the services in an architecture diagram. I’ll write more about this aspect later but TLDR; engineering leaders, should be held accountable for service levels too. Only when leadership is onboard and understands the language of reliability can you realistically use ideas like rule of 10x/9 or lagom SLO.
Unrealistic expectations: Service levels are just a tool to measure and improve reliability. They don’t operate in a vacuum: if you commit to a service level but your dependencies don’t, you don’t go far. Sometimes I see companies that decide to cargo cult what comes out big tech hoping to get big tech results. Big tech has a fundamentally different competition landscape, talent pool, profit/incentive model. Besides, what big tech is willing to communicate publicly may not be actually how it works. At best, it is someone’s interpretation of how things worked at a certain time in a part of the organization. My advice is to always customize and localize these concepts to your culture, product, talent, and business model. That’s what I strive to do at my company. So far, I have done this at 3 companies and every time it was vastly different from the others.
The SRE stigma: Admittedly I learned about service levels from Google’s SRE books. Those books do a great job in defining SRE as a profession at a company like Google (or a company that wants to mimic Google’s narrative). But there’s nothing in service levels that tie to SRE. In fact, I believe service levels have more impact if they’re used by “regular” software engineers. This is not surprising, because according to Ben Treynor Sloss who coined the term, SRE is what you get when you put software engineers in charge of operations. We’re looking at the problem backwards: you don’t have to be or have a SRE to use service levels, but rather you can use service levels to improve the system reliability.
yes for publishing recordings in youtube