Different hats that SRE's wear in the industry: Admin, Architect, Toolsmith, and firefighter
A lot has changed as the industry adopted SRE.
There are many reasons why something that works for Google may not work at other companies. This analogy packs many of those reasons:
⭐⭐⭐ Google’s version of SRE is like a cookbook for 3-star Micheline meals. Other companies got excited and decided to try it. But as soon as they looked in their [home-grade] kitchen, they saw a mess.
Some companies start building the kitchen (e.g. platform engineering)
Some others start buying the ingredients (e.g. outsourcing)
Others start growing chicken and plants for the ingredients (e.g. bespoke solutions often using open-source tools)
All of that are considered steps to cook a nice meal but not all those activities are called “cooking”! You may not even end up with that meal and that’s fine.
I’ve been an SRE at different capacities (Senior, Staff, and Senior Staff) at 3 companies. Over the past 25 years, I have worked with other SREs, DevOps, Architects, Infra, and SysAdmins at a dozen companies. I also write about SRE and mentor a few talented people in this field.
I have observed some patterns of the SRE role across different industries.
A role is a package for a set of expectations. The actual day-to-day tasks at a given company vary based on the industry, size, maturity, trust, and growth anticipations. Moreover, the people carrying the role have diverse backgrounds, skills, and preferences. Add the career ladder, organization topology, and team size to the mix and the definition of role becomes overly complex.
This article takes a different approach by defining a set of SRE archetypes and concludes with a set of tenets that apply to all those archetypes.
Archetype (noun): the original pattern or model of which all things of the same type are representations or copies —Mirriam Webster Dictionary
Let’s step away from theory and see how the SRE role is pragmatically implemented at different companies.
4 archetypes of Site Reliability Engineering
In general, SREs can be grouped into 4 archetypes based on their primary concerns:
Think of these archetypes as “centers of gravity” for the majority of the day-to-day tasks. An individual may put on different hats at different times or sometimes at the same time.
Provision and maintain on-prem or cloud infrastructure (e.g. Kubernetes, CDN, Observability stack, secret management, network topology, etc.)
Manage third party tools and contracts (e.g. AWS, GitHub, Datadog, Cloudflare, etc.) as well as access control and integration to enterprise SSO (single sign-on) solutions.
Optimize resource usage and cost efficiency across the infrastructure and platforms (also known as FinOps).
Some degree of coding using scripts, IaC (infrastructure as code), and configurations for resource management and automation.
Help set up the on-call process, tooling, and rotation for product teams that need it and put them in charge of the code they develop. You build it, you own it.
Monitor resources (e.g. observing the cluster behavior) to identify anomaly and trigger alerts.
Take SOP (standard operating procedure) mitigation actions (e.g. restarting services, emptying cache, debugging credential/certificate issues, etc.). This is different from traditional NOC (network and operation center) because SREs are software literate and can reason about system behavior and mitigate some risks themselves. Moreover, as we see in the Toolsmith archetype, SREs use automation heavily to eliminate toil (repetitive, manual work that does not require a lot of creativity or problem-solving skills).
Note: alert fatigue is demoralizing. Firefighting should only be used to prevent things falling between the cracks. For example:
When there were no alerts for the incident (novel incidents). SREs may do the triage for such incidents (validate the incident, assess priority, and assign the right team).
When the alert pages the wrong team (alerts are set up on symptoms that is in owned by one product team, while the root case may be in another team). The SRE has deep architectural knowledge about system dependencies and may help find or delegate to the right team.
When the alert is orphaned and does not page any product team or the product team’s on-call persons are not accessible.
When the user impact is so significant that it requires involving legal, security, or public communication for formal stakeholder communication. MIM (Major Incident Handler).
Even in those cases, you need to eliminate the need for firefighting as much as possible. A good metric to measure is the percentage of incidents that were handled without the need to involve the central SRE team.
Reduce cognitive load for the product teams, improve developer experience (DX), and optimize self-service where developers can directly get work done without having to go through people. “Platform Engineering” is often a term that’s used to describe this archetype. The platform in this context is any bespoke solution that’s built on top of external infrastructure providers (e.g. what you get out of the box from Azure or your HP rack) to streamline the needs of the product teams.
Help instrument the observability tooling and build bespoke dashboards and service status page based on the needs of the organization for visibility, transparency, and reducing MTTD (time to detect an incident).
Implement and run automated E2E (end-to-end) tests against production services to detect any significant user-facing issue before someone has to pick up the phone and call customer support.
Implement the “golden path” for reducing friction when provisioning new services, as well as maintaining systems at scale (e.g. patching base images across all services through central platform).
Identifies toil (repetitive, manual work that does not require a lot of creativity or problem-solving skills) and automates it. For example: load testing, QA test frameworks, resource provisioning, as well as repetitive mitigations that firefighters execute.
Take part in designing new systems based on the NFR (non-functional requirements) for reliability, scalability, and security. For example, prevent cascading failure, add fallback or failover, implement graceful degradation, and ensure business continuity while maintaining key system functionalities.
Observing system behavior and troubleshooting code to identify performance and reliability bottlenecks and improve code and architecture.
Risk assessment and working with product teams to clarify ownership and accountability for their systems. Help set impactful SLIs and reasonable SLOs to establish formal measurable contracts between teams, stakeholders and dependencies.
Establishing standards and helping the teams to migrate to common solutions to reduce cognitive load, operation costs, and business risks.
These archetypes are not cookie cutter roles. In reality, a single SRE may fit into multiple archetypes and at different times depending on the needs and skills.
Notice that DevOps isn’t an archetype. That’s because DevOps is a set of principals not a role for one person. It involves elements of culture-building, education, tooling and organizational architecture. In practice, all the archetypes contribute to those concerns to implement DevOps.
Tenets of SRE
These posts take a few hours to draft, edit, illustrate, and publish. My monetization strategy is to give away most of the content for free because I believe it helps the community. For those who spare a few bucks to support the time and energy I put into this, the following section is a token of appreciation. A paid subscription also gives you access to my WIP book Reliability Engineering Mindset. Right now, you can get 20% off via this link.