A few trends dominate the past few years:
Our lifestyle increasingly relies on software systems. This makes the reliability of those systems even more vital and consequential.
As AI penetrates SDLC (software development lifecycle), we need reliability mechanisms to stay in control of the ever-increasing complexity.
Vast amounts of data are captured about individuals, devices, and businesses. If “data is gold”, the ability to process that data and turn it into information and wisdom in a timely manner depends on reliable systems.
Google has published 3 fantastic books on the topic but as a software developer, SRE practitioner and tech lead responsible for a large organization, I see a gap between where those books end and where most companies need to act.
This gap has multiple reasons:
The timing of Google’s books and the fact that they are free (Google’s well-known strategy to build traction), suggests that they are at least partially marketing material for Google’s cloud business. Therefore, the wisdom in those books are generic enough to apply to most potential GCP (Google Cloud Platform) consumers. Although the books are cloud agnostic, many ideas in the book require solid tooling or at least a solid understanding of the topics to be able to take full advantage of them.
Google, and by extension “big-tech” are what I call software-native companies —companies that are funded by software developers and see software as a first-class product. On the other side of the spectrum, there are glorified IT departments which develops software as a means to an end. If they could achieve that goal without software, they’d happily take it. These are the companies that don’t make money from software, but rather save money with it. Software is a cost center. There’s a big gap between software-native and glorified IT end of the spectrum in terms of: funding/budgeting/investment, talent pool, competition model, market, and product. What works at Google may not necessarily work at other companies.
This book is a collection of my trickiest learnings about reliability engineering at 3 companies in simple language. We breakdown complex topics with lots of examples and illustrations to make reliability engineering approachable to every software engineer. The idea is to take you from knowing nothing about service levels to using them as a tool to measure and improve reliability.
This book has 3 parts which go hand in hand:
Reliability: aims to build a language to speak about reliability and set impactful SLIs, reasonable SLOs, and realistic SLAs
Engineering: talks about various techniques to improve architectural and software reliability
Mindset: builds a mentality about how to reach full ownership with practical examples
FAQ
Q. What makes you qualified to write on this topic?
I have spent over 2 decades designing, building, and maintaining software systems across a wide range of industries that have high system reliability requirements.
I also have a MSc in Systems Engineering which paired with various roles help me see patterns, tools, and mindset for building reliable systems.
Since 2018, I have been responsible for reliability engineering across 3 companies as senior, staff, and senior staff engineer. As I invested more time in learning these topics, I found myself talking to engineers, managers, and high-level leadership to invest in system resilience, improve operations, and measure the right thing. But I think it’s a waste to use these learnings only at work. This book is my effort to share my toolbox with the world and hopefully improve the state of software across industries.
Q. When will the book be done?
Some chapters are already available to read. I’ll add more chapters when I have time. You can see the status in the Table of Contents.
Q. Will there be a paper book?
For now, the book is only available from this blog/newsletter.
I don’t know if there will ever be a paper book because:
I personally don’t read long books and prefer self-contained pages
A print book requires time investment before it is out. I decided to get the content out gradually as I write it.
I need to redo most of the illustrations for a print book due to the dark color theme of this site.
I like to update the content as I learn more, this is extremely hard with a book that’s released.
If you are a subscriber, you get the first draft as soon as it’s out. I’ll update this answer when I know more. But I’m hoping sometime during 2024.
Just looking at the table of contents, it looks like an amazing book!
can't wait for the book to come out,this is exciting work Alex!