

Discover more from Alex Ewerlöf Notes
Reliability Engineering Mindset
Subtitle: Concepts, Patterns, and Tools for building, maintaining, and evolving reliable software products
A few trends dominate the past few years:
As our lifestyle relies more on computers, we need to take reliability of those systems even more seriously
As AI assisted programing tools write more software, we need reliable mechanisms to be in control
Software is increasing in complexity, both in terms of number of components but also the amount of “black box code” in those components
This book is a collection of my trickiest learnings in simple language. Many of the chapters are already written as standalone blog posts.
Google SRE books do an excellent job in explaining WHAT of Site Reliability Engineering. But I feel the WHY and HOW part is missing in a way that is approachable to software engineers.
That is why I decided to author this book. It is a collection of essays and breakdown of complex topics to make reliability engineering approachable to every software engineer.
Instead of repeating the technical details which are one Google or ChatGPT query away, this book tries to build a more timeless mental model and attitude towards reliability.
The idea is to take you from knowing nothing about service levels to using them as a tool to measure and improve reliability.
Change and complexity are the number 1 & 2 enemies of reliability, respectively. But we cannot solve the problem with the same mindset that created it. The solution is definitely not to stop the change or complexity. The solution is to define the risk tolerance and use service levels to systematically assess and control that risk while not hindering the pace of innovation and change.
It requires a combination of tools, models and mindset. That’s what this book is about.
Table of Contents
NOTE: Section has a title and description. If the contents are already written, the title links to where you can read it.
🟢= Written (click the link to read)
🟠= Draft is being reviewed, edited, and illustrated
🔴= Is not written yet
🆓= Available for free
🪙= Partially paywalled
Maturity levels 🟠
Not every team needs service levels, not everyone is receptive to the idea. This chapter is about defining a maturity model that sets the right expectations and lays out a framework for gradually evolving into full ownership.
How reliability is perceived 🟠
We show some examples and then introduce a simple workshop format that can be used to identify the reliability perception and set SLOs for various team topologies.Risk assessment 🟠
We need to build a language about how we talk about risk. This chapter builds that jargon to an extent that’s required to understand the book.
The Service Level workshop 🟠
This is the workshop I run professionally for different teams to identify the risks, key metrics, and a reasonable objective.
Service Level Document 🟠
This is a tool to communicate expectations between teams and stakeholders. We talk about a format that is based on what Google suggests and builds on top of it to create a contract.Crafting a useful model 🟠
Service Levels are simplified models to map system behavior to user behavior. What defines a good model and how to test it?Availability 🟠
This is one of the most common types of metrics. We discuss 4 types of availability metrics and when to use which.
Latency 🟠
Another popular type of service level indicator.
Error rate 🟠
Or more properly: Success Rate is a common type of service level indicator
Reliability of data systems 🟠
Different metrics to measure reliability of data and stateful servicesReliability of AI systems 🟠
Various metrics that is used to measure reliability of AI modelsIntroduction Service Level Indicator (SLI) 🟢🪙
Delve into SLI with examples, decisions to be made and what is not a SLI.What is time-based vs event-based SLI? 🟢🆓
One of the first decisions about SLI is whether it is time-based or event-based.
What is “valid” in SLI and why not use “total”? 🟢🆓
The denominator in the SLI formula is often mistaken for "total". It's a huge miss if we don't use it to scope our optimization effort and clarify ownership.
What is the definition of “good” in SLI? 🟢🆓
How do we define the good portion of time or events?
Where to measure matters. Why do you get different readings? 🟢🆓
Where we measure is sometimes more important than the actual value we measure.
Service Level Objective (SLO) 🟠
Deep dive into SLOs, the decision to be made, and clearing common misconceptions around them.Rule of 10x per 9 🟢🪙
For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.Multi-tiered SLOs 🟢🪙
Using different parameters, we can set multiple objectives for the same indicator.Compliance period (AKA “window”) 🟠
Different types of compliance periods and how to pick one that suits your product?Alerting on SLOs 🟠
What’s the point of measuring if we don’t commit to keep it in good shape? This chapter examines how we convert our SLO commitments to alerts. We also use the Service Level CalculatorOn-call 🔴
The practicalities (payment, off time, contract, etc.), toil reduction, MTT* metrics,
Service Level Agreements (SLA) 🟠
Defines SLA and clarifies its difference with SLO. Examine S3's SLA as one of the oldest and most reliable services on the internet.
The legal commitment and punishment models 🟠
The loopholes and some examples 🔴
How to set the level for SLA 🟠
What is reasonable to commit based on composite system reliability
The introduction chapter builds a language for architectural resilience. This chapter talks about how to build a system with reliability as a requirement. load balancer, cache, redundancy, rigidity/fragility, viscosity, graceful degradation, etc.
Composite system reliability 🟢🆓
How to calculate the reliability metrics of a complex system that’s composed of multiple sub-systems?
Architectural patterns
Deployment patterns 🟠
Blue Green 🟠
Canary 🟠
Dark Launch 🟠
Feature flags 🟠
Tooling and requirements 🔴
Observability: Metrics, Logs, Traces 🔴
Monitors and alerts 🔴
Incident handling and postmortem 🔴
Reliability Engineering doesn’t happen in vacuum. It needs a special mindset and mental model both at the engineering and leadership level.
Getting the buy in from leadership🟠
Every organization is different, but most leaders care about results. How can we frame service levels with a clear outcome to get the time budget to shift the way of working?
What problems can service levels solve and what problems don’t they solve? Why should a particular organization use it and why doesn’t it make sense for another? How does operating with Service Levels change the way of working: accountability, transparency, and data-driven optimization. How to communicate complex technical topics to an audience who is not necessarily technical but needs to assess the tool.Motivates and defines true ownership trio and its 3 elements: knowledge, mandate and responsibility.
6 archetypes of broken Ownership 🟢🆓
A follow up discussing various broken ownerships.
Team ownership 🟠
What exactly does the ownership trio look like at a team level?
Individual ownership 🔴
How do the knowledge, mandate and responsibility empower an individual in their career
Organizational ownership 🟠
How can an organization set itself up for true ownership?
We use the washing machine analogy to nail the pros and cons of owning your washing machine at home or relying on the central laundry room.
We also point to relevant concepts from Team Topologies (the book) and offer slightly different perspectives on how to control cognitive load.Amortization plans for Tech debt 🟠
We also discuss how to take care of tech debt and make stronger architectural decisions without the burden of a technical committee.
QA
Q. Who are you?
I have spent over 2 decades designing, building, or maintaining software systems across a wide range of industries that require high system reliability. I also have a MSc in Systems Engineering which paired with various roles help me see patterns, tools, and mindset for building reliable systems.
Q. When will the book be done?
A. Some chapters are already available to read. I’ll add more chapters when I have time. You can see the status in the Table of Contents above.
Q. When will the book be available for purchase?
A. You mean as a paper book? I don’t know. I decided that the practical matters shouldn’t block the writing process.
The chapters will come out gradually and be linked here. If you are a subscriber, you get the first draft as soon as it’s out. I’ll update this answer when I know more. But I’m hoping sometime during 2024.
Q. Will this be one of those books where you basically package and sell a bunch of blog posts?
A. I use the free blog/newsletter channel to test ideas and get feedback. The book will contain many of those ideas, but the tone, illustration and cohesion will be different than a bunch of free-standing blog posts.
Q. How much will it cost?
A. I don’t know at the moment. But I’ll update it once I know more.
Q. What does the competition look like for this book?
A. This book addresses the same audience as Google’s free SRE books:
The Unicorn Project by Gene Kim
We are not going to repeat the wisdom of those books, but rather compliment them. Although this book picks up where the books above left, there will be enough background info to make it a free-standing read, so you don’t have to read those other books to be able to read this book.
Q. Any estimate on page/word count?
A. Page/Word count: this is a vanity metric. There’ll be as minimum words as necessary to build a mindset. My goal is for this to be a light book because I don’t like thick books. It may be that each part becomes its own book if the total is more than 300 pages.
Reliability Engineering Mindset
This is super exciting Alex, congrats!
can't wait for the book to come out,this is exciting work Alex!