Service level adoption obstacles
Why do many organizations fail to adopt and implement service levels properly and what are the practical tips to overcome the pitfalls?
I’ve been working with Service Levels (SLI/SLO) across several businesses now. I’ve run tens of service level workshops, built a tool with lots of examples, and talked to hundreds of engineers and leaders to get them onboard with service levels. I’ve had podcast episodes, talked to other companies, and even drafted a book about it.
It is safe to say that I am invested in service levels 😄If I could sum up my experience with service levels in one sentence, that would be:
Service levels are the best tool to measure and optimize reliability of complex software systems, but they are heavily underrated.
But why?
First let’s unpack the first part of my claim. Service levels are the best tool to:
Debunk the assumption that “systems should never fail” and the inevitable panic when they do.
Acknowledging that complex systems fail all the time and shift the conversation to identifying “what is failure” (SLI) and how much of it can the consumers tolerate (error budget)
Normalize failures by using percentage to put them in the larger perspective
Shift the effort from blindly improving reliability to optimizing reliability to reach an adequate level, not more.
Map responsibility to control and use it for organization design to reduce handover and risk of things falling between the cracks.
Aligned autonomy (what goes on inside a team is no one’s business as long as the measured service meets the expectations)
Reduce alert fatigue by alerting on error budget burn rate.
Establish a culture of measurement, data-informed decision, and talking to consumers about their perception of the service level
Treating reliability as a feature to be weighed against other features considering its ROI (rule of 10x/9)
Map reliability to accountability (topic of an upcoming article)
Full ownership (and preventing broken ownership)
Blameless postmortem to learn from incidents instead of finger pointing
But, if service levels are so great, why are they so heavily underused? I have been thinking and reflecting on that topic in the past few months and here are the top reasons in my experience:
Mental shift: the basic principles of service levels are not counter-intuitive or new. Yet their proper adoption and implementation requires a mental shift. For too long the conventional wisdom was to pick some metrics (often by engineers without talking to consumers), attach some alerts to metric thresholds, and panic for every incident. This vicious cycle was empowered by a plethora of tools that were built around this workflow (if you can call it that!). In my experience the so called “juniors” who have no experience with monitoring, on-call, incident handling, are much more receptive to giving service levels a try. On the other hand, the “seniors” are often reluctant or flat out resistant to these ideas. What gives? The baggage! In the age of knowledge workers, the ability to unlearn is as important as the ability to learn. Often, we are trapped in our perceptions and fail to gain new perspectives. The mindset alone is important enough for me to spend a third of my book Reliability Engineering Mindset to its individual and collective aspects.
Jargon: SLI, SLO, SLS, SLA, error budget, Burn Rate, Valid, Threshold, Time/Event based, Composite, Multi-Tier, … Yeah! And all you wanted was informed decision to optimize service reliability 😖 I have written many articles to break down those complex topics and even trained an LLM to have an elevated discussion about these concepts. But it doesn’t take away the fact that the language of reliability and data has a learning curve, and it takes time to speak this language fluently. It’s a skill to learn but is it worth the ROI? I believe so. As AI writes more code, there’s an increasing need for people who can reason about the complex behavior of the “black box” and measure and optimize reliability because we are simultaneously becoming more dependent on these complex interconnected systems.
Math: Arguably the math for Service Levels is far inferior to what is taught in 10th grade. But still, math can be intimidating to some people. I have made an interactive tool to make the service level math more approachable. Concepts like percentiles and statistics can be counter-intuitive at first, but once you get hold of it, it makes you a better critical thinker and comfortable with data and metrics. These skills are superpowers in the age of AI-generated content and realistic deep fakes. Math isn’t about strings of Greek letters, it’s the language of the universe. One who is put off by math has bigger problems with the universe. 🌌
Dunning Kruger effect: I’ve met many professionals who claim to know these topics very well, but once you dig deeper you realize, they are full of misunderstanding, prejudice, and misconception. What surprises me are some people with SRE or DevOps in their job title. Some of them don’t know the difference between SLO and SLA (which is one of my favorite interview questions when hiring SREs)! There are so many nuances and pitfalls when implementing service levels that I could literally fill a book with it —and that’s exactly what I’m doing. What bugs me the most is resistance and arrogance. I have spent years on this topic across different environments and every now and then have a new “Aha” moment. It baffles me how some “professionals” think they have figured it out after reading a few pages on the internet, watching some videos, or giving it a half a** try (which we’ll get to).
Premature implementation: When Google releases their excellent books (as part of their strategy to build credibility and bring developers to their cloud platforms) many people got excited. What happened in reality was that they read parts of the book (Chapter 2 is a favorite) and quickly started implementing. The tooling/engineering aspect is always more attractive than the overarching cultural and mental shifts. Many teams made dashboards and set alerts. Those who didn’t have tooling started building or buying them. But the essence of reliability engineering as a cultural shift and mental model got lost somewhere along the way while we were engineering new shinier tools. Fast forward a few years and those dashboards are gathering dust and the only people who care about them are the “SREs” who implemented them.
Poor tooling: let’s face it, 21 yeas after the word SRE was coined, and 8 years after Google went big externally on the topic, the tooling built around service levels is crap at best! And I don’t mean to intimidate people in the slightest. I have tried Datadog, Grafana, Elastic, and a few others. NO ONE GOT IT RIGHT! It is as if they just wanted to check a box to add SLO as a feature to their products. SLO is not just a feature. It is THE way you add value to empower reliability engineering, reshape the organization, and build a culture of data-driven optimization. The state of tooling is so bad that companies like Nobl9 found a niche to build a business (if you’re an observability platform provider, your best bet is to buy them because they have some of the best industry thought leaders). And the native tooling you get out of the box from your favorite cloud provider is mediocre at best. The only exception is Google Cloud —for understandable reasons! I believe these concepts are too powerful to be exclusive to one provider (for comparison this is what Azure and AWS have in the market). As an industry we could do better. The precondition is to give SLOs another chance not as an abstract complex topic but as a new mindset that travels beyond tools, job titles, individual beliefs, and corporate fabricated cultural slogans. Until then, I keep pushing with whatever tool I have at my disposal.
Negative experience: I've seen service levels being weaponized by leadership, vanity metrics being measured leading to no action or worse: wrong actions. Just because someone abused a tool, it doesn't mean the tool is bad. I’ve also heard stories about when a company hires leaders from big tech (e.g., Google) who demand SLOs but end up only doing the mechanical part (tooling, measuring something, dashboards on TV, etc.) without shifting the deeper belief systems.
Unrealistic expectations: Alas, many organizations and engineers have a negative experience with transparency (and measurement in general). For SLOs to deliver their full potential, we must create a culture that is comfortable with measurement, verified impact, transparency, accountability, and full ownership. That is a much harder problem to solve. Computers are logical, fast, and accurate creatures. People are slow, error prone, and have individual interests and agendas. Throwing SLO at an organization and hoping for it to improve reliability is like throwing a gold-plated wrench into a complex machine and hoping that the wrench will find the right bolts and fix it (tip: the machine’s problem may not even be a bolt but oil!).
The SRE stigma: The concept of service levels gained popularity with Google’s SRE book. Many organizations don’t do SRE because “Platform Engineering deprecated SRE!”, or “SRE works at Google”. What is SRE really? I have identified 4 archetypes but it boils down to applying software engineering principles to operations! There’s NOTHING in the service levels as a concept that requires someone with “SRE” in their job title. They’re useful as-is by any software engineer. Measuring the right metric and tying responsibility to mandate doesn’t need a dedicated title. Neither does normalizing failure and seeing them as learning opportunities. Service levels are completely decoupled from SRE. And I say that as someone who makes a living as a Senior Staff Site Reliability Engineer. In this newsletter, I’m determined to break that stigma and make service levels approachable for all software engineers.
There’s more, but hopefully the point is clear:
If you have a service, you have a service level, whether you acknowledge, measure, and communicate it or not.
Pro tips
Let’s go through each of the obstacles above and talk about actionable recommendations to overcome them.
My monetization strategy is to give away most content for free. However, these posts take anywhere from a few hours to days to draft, edit, research, illustrate, and publish. I pull these hours from my private time, vacation days and weekends. Recently I went down in working hours and salary by 10% to be able to spend more time sharing my experience with the public. You can support this cause by sparing a few bucks for a paid subscription. As a token of appreciation, you get access to the Pro-Tips section as well as my online book Reliability Engineering Mindset. Right now, you can get 20% off via this link. You can also invite your friends to gain free access.
So how can you adopt service levels and help your organization with all the benefits that were mentioned?