

Discover more from Alex Ewerlöf Notes
I have spent over 2 decades designing, building, or maintaining software systems across a wide range of industries that require high system reliability. I also have a MSc in Systems Engineering which paired with various roles help me see patterns, tools, and mindset for building reliable systems.
This book is a collection of my trickiest learnings in simple language. Many of the chapters are already written as standalone blog posts.
Instead of repeating the technical details which are one Google or ChatGPT query away, this book tries to build a more timeless mental model and attitude towards reliability.
The idea is to take you from knowing nothing about service levels to using them as a tool to measure and improve reliability.
Metadata
Title
Reliability Engineering Mindset
Subtitle: Concepts, Patterns, and Tools for building, maintaining, and evolving reliable software products
Book description
4 trends dominate the past few years:
Reliance of our lifestyle over software
AI assisted programs and AI replacing jobs
Increasing complexity of software both in terms of number of components but also the amount of “black box code” in those components
The exponential growth of the above trends
Change and complexity are the number 1 & 2 enemies of reliability, respectively. But we cannot solve the problem with the same mindset that created it. The solution is definitely not to stop the change or complexity. The solution is to define the risk tolerance and use service levels to systematically assess and control that risk while not hindering the pace of innovation and change.
It requires a combination of tools and mindset. That’s what this book is about.
Unique selling points
Attention is rare, even rarer in the age of AI-generated content. I know you’re busy so instead of wasting your time with pages upon pages of technical data, this book focuses on building the necessary mindset using the absolute minimum number of words. We rely on funny cartoons and memes and stories instead of boring complex diagrams.
The bar is set at this level: you should be able to pick up the book on your commute and still be able to pick up something useful in 5-10 minutes of reading.
Keywords
Site reliability engineering
Service levels
Service level indicators
Service level objectives
Alerting on SLOs
Ownership
Resilience patterns for software architecture
Table of Contents
NOTE: Section has a title and description. If the contents are already written, the title links to where you can read it.
🟢= Written (click the link to read)
🟠= Draft is being reviewed, edited, and illustrated
🔴= Is not written yet
🆓= Available for free
🪙= Partially paywalled
Part I: Service Levels
Chapter: Why bother with Service Levels? 🟢🆓
Maturity levels🟠🆓
Not every team needs service levels, not everyone is receptive to the idea. This chapter is about defining a maturity model that sets the right expectations and lays out a framework for gradually evolving into full ownership.
How reliability is perceived 🟠🆓
We show some examples and then introduce a simple workshop format that can be used to identify the reliability perception and set SLOs for various team topologies.Risk assessment 🟠🆓
We need to build a language about how we talk about risk. This chapter builds that jargon to an extent that’s required to understand the book.
The Service Level workshop 🟠🪙
This is the workshop I run professionally for different teams to identify the risks, key metrics, and a reasonable objective.
Service Level Document🟠🆓
This is a tool to communicate expectations between teams and stakeholders. We talk about a format that is based on what Google suggest and build on top of it to create a contract.Crafting a useful model🟠🪙
Service Levels are simplified models to map system behaviour to user behaviour. What defines a good model and how to test it?Availability 🟠🆓
This is one of the most common types of metrics. We discuss 4 types of availability metrics and when to use which.
Latency 🟠🆓
Another popular type of service level indicator.
Chapter: Service Level Indicator (SLI) 🟠🆓
What is time-based vs event-based SLI? 🟢🆓
One of the first decisions about SLI is whether it is time-based or event-based.
The denominator in the SLI formula is often mistaken for "total". It's a huge miss if we don't use it to scope our optimization effort and clarify ownership.
How do we define the good portion of time or events?
Where we measure is sometimes more important than the actual value we measure.
Chapter: Service Level Objective (SLO) 🟠🆓
Compliance period (AKA “window”) 🟠🆓
Different types of compliance periods and how to pick one that suits your product?Alerting on SLOs 🟠🪙
What’s the point of measuring if we don’t commit to keep it in good shape? This chapter examines how we convert our SLO commitments to alerts. We also use the Service Level CalculatorOn-call 🔴🆓
The practicalities (payment, off time, contract, etc.), toil reduction, MTT* metrics,
Chapter: Service Level Agreements (SLA) 🟠🆓
Defines SLA and clarifies its difference with SLO. Examine S3's SLA as one of the oldest and most reliable services on the internet.
The legal commitment and punishment models 🟠🆓
The loopholes 🔴🪙
The level 🟠🆓
What is reasonable to commit based on composite system reliability
Part II: Reliable Architecture Patterns
The introduction chapter builds a language for architectural resilience: load balancer, cache, redundancy, rigidity/fragility, viscosity, graceful degradation, etc.
Chapter: Calculating System Reliability 🟠🆓
Composite system reliability 🟢🆓
How to calculate the reliability metrics of a complex system that’s composed of multiple sub-systems?
Chapter: Deployment 🟠🆓
Blue Green 🟠🆓
Canary 🟠🆓
Dark Launch 🟠🪙
Feature flags 🟠🪙
Chapter 8: Runtime 🟠🆓
Part III: Technical Leadership 🟠
Chapter: Selling it to the leadership🟠🆓
Every organization is different, but most leaders care about results. How can we frame service levels with clear outcome to get the time budget to shift the way of working?The outcome🔴🪙
What problems can service levels solve and what problems don’t they solve? Why should a particular organization use it and why it doesn’t make sense for another?The way of working🔴🪙
How does operating with Service Levels change the way of working: accountability, transparency, and data-driven optimization.The tool🔴🪙
How to communicate complex technical topics to an audience who is not necessarily technical but needs to assess the tool.
Chapter: You build it you own it 🟢🆓
Motivates and defines true ownership trio and its 3 elements: knowledge, mandate and responsibility.
A follow up discussing various broken ownerships.
Team ownership 🟠🆓
What exactly does the ownership trio look like at a team level?
Individual ownership 🔴🪙
How do the knowledge, mandate and responsibility empower an individual in their career
Organizational ownership 🟠🪙
How can an organization set itself up for true ownership?
Chapter: Platform team 🟠🆓
We use the washing machine analogy to nail the pros and cons of owning your washing machine at home or relying on the central laundry room.
Chapter: Amortization plans for Tech debt 🟠🆓
We also discuss how to take care of tech debt and make stronger architectural decisions without the burden of a technical committee.
QA
Q. When will the book be available for purchase?
A. I don’t know. I decided that the practical matters shouldn’t block the writing process. The chapters will come out gradually and be linked here. I’ll update this answer when I know more. But I’m hoping sometime during 2024.
Q. Will this be one of those books where you basically package and sell a bunch of blog posts?
A. I use the free blog/newsletter channel to test ideas and get feedback. The book will contain many of those ideas, but the tone, illustration and cohesion will be different than a bunch of free-standing blog posts.
Q. How much will it cost?
A. I don’t know at the moment. But I’ll update it once I know more.
Q. What does the competition look like for this book?
A. This book addresses the same audience as Google’s free SRE books:
The Unicorn Project by Gene Kim
We are not going to repeat the wisdom of those books, but rather compliment them. Although this book picks up where the books above left, there will be enough background info to make it a free-standing read, so you don’t have to read those other books to be able to read this book.
Q. Any estimate on page/word count?
A. Page/Word count: this is a vanity metric. There’ll be as minimum words as necessary to build a mindset. My goal is for this to be a light book because I myself don’t like thick books. It may be that each part becomes its own book if the total is more than 300 pages.
Q. Any timeline for when the first two chapters will be out?
A. before the end of 2023. There are already many subchapters out and linked in the TOC below. One final editing is needed to make everything cohesive.
Reliability Engineering Mindset
This is super exciting Alex, congrats!