Table of Contents
(🟢=Written, 🟠= Drafted, 🔴= Not written)
Introduction 🟢
Elaborates why the world needs one more book and what was the motivation behind writing it.
Part I: Reliability
How reliability is perceived 🟠
We show some examples and then introduce a simple workshop format that can be used to identify the reliability perception and set SLOs for various team topologies.Degradation vs disruption 🟢
What's the difference between service degradation, service disruption, and service outage and why does it matter?Risk assessment 🟠
We need to build a language about how we talk about risk. This chapter builds that jargon to an extent that’s required to understand the book.
What is a service? 🟢
Before talking about SLI and SLO, we need to clarify what a service is.The Service Level workshop 🟠
This is the workshop I run professionally for different teams to identify the risks, key metrics, and a reasonable objective.
Service Level Document 🟠
This is a tool to communicate expectations between teams and stakeholders. We talk about a format that is based on what Google suggests and builds on top of it to create a contract.Crafting a useful model 🟠
Service Levels are simplified models to map system behaviour to user behaviour. What defines a good model and how to test it?Availability 🟠
This is one of the most common types of metrics. We discuss 4 types of availability metrics and when to use which.
Latency 🟠
Another popular type of service level indicator.
Error rate 🟠
Or more properly: Success Rate is a common type of service level indicator
Reliability of data systems 🟠
Different metrics to measure reliability of data and stateful servicesReliability of AI systems 🟠
Various metrics that is used to measure reliability of AI modelsIntroduction Service Level Indicator (SLI) 🟢
Delve into SLI with examples, decisions to be made and what is not a SLI.What is time-based vs event-based SLI? 🟢
One of the first decisions about SLI is whether it is time-based, or event-based.
What is “valid” in SLI and why not use “total”? 🟢
The denominator in the SLI formula is often mistaken for "total". It's a huge miss if we don't use it to scope our optimization effort and clarify ownership.
What is the definition of “good” in SLI? 🟢
How do we define the good portion of time or events?
Where to measure matters. Why do you get different readings? 🟢
Where we measure is sometimes more important than the actual value we measure.
Service Level Objective (SLO) 🟢
Deep dive into SLOs, the decision to be made, and clearing common misconceptions around them.Service Level Status (SLS) 🟢
SLS shows the status of the SLO commitment according to the SLI metric towards the service consumer at any given point in time.Rule of 10x per 9 🟢
For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive.Multi-tiered SLOs 🟢
Using different parameters, we can set multiple objectives for the same indicator.Compliance period (AKA “window”) 🟢
Different types of compliance periods and how to pick one that suits your product?Alerting on SLOs 🟠
What’s the point of measuring if we don’t commit to keep it in good shape? This chapter examines how we convert our SLO commitments to alerts. We also use the Service Level CalculatorOn-call 🔴
The practicalities (payment, off time, contract, etc.), toil reduction, MTT* metrics,
Service Level Agreements (SLA) 🟢
Defines SLA and clarifies its difference with SLO. Examine S3's SLA as one of the oldest and most reliable services on the internet.
The legal commitment and punishment models 🟠
SLA tricks 🟠
Various engineering and legal techniques that you can use to your advantage to reduce the consequences of system failure.How to set the level for SLA 🟠
What is reasonable to commit based on composite system reliability.
Part 2: Engineering
This part goes through different architectural patterns to improve software resilience.
Composite system reliability 🟢
How to calculate the reliability metrics of a complex system that’s composed of multiple sub-systems?
Percentiles 🟢
Why percentiles are important for understanding and optimizing system behavior, when to use them and how.Non-functional requirements (NFR) 🟢
What are functional and non-functional requirements and where does reliability fit?Architectural patterns
Deployment patterns 🟠
Blue Green 🟠
Canary 🟠
Dark Launch 🟠
Feature flags 🟠
Break glass 🟠
Tooling and requirements 🔴
Observability: Metrics, Logs, Traces 🔴
Monitors and alerts 🔴
Incident handling process 🔴
Postmortems and fire drills 🔴
Service Level Calculator 🟢
Introducing the service level calculator as a learning tool and utility in the hands of engineers and business stakeholders to set reasonable SLOs and understand the alerting rules.
Part III: Mindset
Reliability Engineering doesn’t happen in vacuum. It needs a special mindset and mental model both at the engineering and leadership level.
Service level adoption obstacles 🟢
Why is one of the greatest ways to measure and improve reliability so heavily underused and what are the most common ways companies fail to implement them correctly?Service Level adoption steps 🟠
Not every team needs service levels, not everyone is receptive to the idea. This chapter is about defining a maturity model that sets the right expectations and lays out a framework for gradually evolving into full ownership.
Getting the buy in from leadership🟠
Every organization is different, but most leaders care about results. How can we frame service levels with a clear outcome to get the time budget to shift the way of working?
What problems can service levels solve and what problems don’t they solve? Why should a particular organization use it and why doesn’t it make sense for another? How does operating with Service Levels change the way of working: accountability, transparency, and data-driven optimization. How to communicate complex technical topics to an audience who is not necessarily technical but needs to assess the tool.Motivates and defines true ownership trio and its 3 elements: knowledge, mandate and responsibility.
6 archetypes of broken Ownership 🟢
A follow up discussing various broken ownerships.
What exactly does the ownership trio look like at a team level? You should never be responsible for what you don’t control, and you should take control of what you are held responsible for.
Individual ownership 🔴
How do the knowledge, mandate and responsibility empower an individual in their career
Organizational ownership 🟠
How can an organization set itself up for true ownership?
4 archetypes of SRE 🟢
2 decades after the term SRE was coined, there are many flavors of SRE out in the wild and people who carry the title have a diverse range of skills.Engineering Leadership Ownership 🟢
Engineering leaders have an overlap with technical leaders. This article tries to clarify their responsibilities with examples and prevent one of the most common anti-patterns for tech leads.We use the washing machine analogy to nail the pros and cons of owning your washing machine at home or relying on the central laundry room.
We also point to relevant concepts from Team Topologies (the book) and offer slightly different perspectives on how to control cognitive load.Amortization plans for Tech debt 🟠
We also discuss how to take care of tech debt and make stronger architectural decisions without the burden of a technical committee.
Writing a book is a tremendous task. Between August 2023 to July 2024, I have spent more than 500 hours to draft, edit, research, and illustrate the book and it is not yet finished.
My monetization strategy is to give away most content for free. I pull these hours from my private time, vacation days and weekends. Since April 2024 I went down in working hours and salary by 10% to be able to spend more time learning and sharing my experience with the public.
You can support this cause by sparing a few bucks for a paid subscription. As a token of appreciation, you get access to the Pro-Tips sections on all posts.
Right now, you can get 20% off via this link. You can also invite your friends to gain free access. If none of that works for you, please share this book in your circles to help others discover it. Thanks in advance. 🙏