Discover more from Alex Ewerlöf Notes
Washing machine vs laundry room
The tale of two platform approaches: central platforms vs bespoke platforms
Years ago, I was working as a SRE (site reliability engineer) at a bespoke platform team behind some of Scandinavia’s largest web sites.
At the time our backend workload ran on EC2 containers in AWS using some tooling from Convox. For the most part, it was running smoothly but at one point during 2019 all servers crashed simultaneously and then back up again.
As far as the end users were concerned, it was just a few minutes of hiccup and thanks to our cache layer, probably no one noticed it. But we engineers did.
To make sure that it didn’t happen again, we started digging. And we kept digging to no avail. Empty handed; we assume that it was an AWS issue. It was (and still is) proprietary after all.
Kubernetes was hot on the market. Being a self-learner, I put some Raspberry Pi 4’s together to build a cheap Kubernetes cluster to experiment with.
After some learning, I suggested that we do an experiment with Kubernetes at work. At least on paper it felt like if we change from EC2 to Kubernetes, we’ll be able to reason about our cluster better.
Impressed by its DX (developer experience) and our own speed, we went to our engineering manager asking if we could get some time to replace EC2 with Kubernetes.
He was reluctant but didn’t stop us.
Our company also had a central platform and team which ran massive Kubernetes clusters for other parts of the company. They also had lots of other teams who specialized in various central platform-related services.
So, the natural next step for us was to work closely with them to figure out how we can migrate our EC2 workload to their Kubernetes cluster. As far as we knew, this would be a win-win situation: we would outsource the remaining infrastructure work to the central team which specialized in this, and it would free our hands to work more on our application platform.
As we got closer to the finish line, the leadership became more against this move. At the time, I didn’t understand the “why?”. It looked like a political move to me, and to some extent it was but not in a malicious way.
You see, by running on top of the central infrastructure, we would give away part of our freedom, flexibility, and autonomy:
Unlike when having our own setup, we would be treated like yet another customer. Our requests would have to go through some ticketing system and be prioritized against the rest of the requests coming from across the larger corporation.
Sure, we would potentially get better compliance and security “for free”, but our application specific “edge case” requirements had to go through the central infra team which acted as a gate keeper. This would create friction and slow us down.
We would abstract away some of our concerns, but we would also be abstracted away from one of the key elements of ownership. The way they would treat us wouldn’t be that different from how traditional operations teams treated developers. We had to live through the pains that had already given birth to DevOps (developer operation). For those who don’t know one of the main tenets of DevOps is to break the wall between those who develop the code and those who operate it. By outsourcing our operational problems to another part of the organization, we would also lose some of the learning.
But the biggest problem was alignment. We had little formal leverage to make the platform team give us what we needed to deliver value. We’ll discuss how platform teams can reduce that risk by having product thinking, but at that company this was our setup in the org chart:
The biggest difference between the bespoke and central platform was this:
Our bespoke solution was part of the same org that built the application for the consumers
The central platform was part of a tall silo with thick walls that extended all the way to the CEO. They had their own roadmap, WoW (way of working), and schedules. There was little we could do to impact them formally even though we were closer to the money (how the business makes profit).
Disappointed, we pulled the plug on that one. But I was determined to get my hands on Kubernetes. As the saying goes:
Either change your company or change your company.
And that’s exactly what I did! 😄
At my next job I was working on the other side of the fence in a central infrastructure team. Of course, we didn’t call it that. We were all SREs. But there was ticketing, babysitting, gate keeping and getting in the way of product teams.
I went on through my career to see a few more stories but today I’m ready to put all of those stories and learnings in a simple and relatable framework:
Washing Machine vs Laundry Room
Many condos have a community laundry room which is either free (funded through the monthly fee) or costs a few coins. It looks something like this:
The idea is simple: if you want to wash clothes, all you have to do is to book a time and show up with your dirty clothes at the community laundry room. There are durable, rugged machines that can wash clothes with the best quality. If any of those washing machines break, it’s not your problem. There are dedicated people who will fix it. You don’t have to pay for water or electricity. All you do is show up.
Sounds great, right?
Well, people who have used the community laundry room may talk about the other side of the coin:
There might not be enough machines for all the residents, leading to long wait times or even conflicts over machine usage.
You have to plan your day around the schedule of what’s available in the laundry room
If there’s a problem with the room or its services, you could call and report an incident, but you’ll most likely lose your time slot and have to rebook another one
If you’re late to pick up your clothes, the next customer may take your laundry out of the washers
Some would have friends and family come and use those machines even though they’re not the ones paying for it
You may need a special service that those machines don’t provide. The functionality of those machines is usually limited to the common denominator of popular demand.
Some might leave behind detergent spills, lint, or even forgotten laundry, which can be inconvenient and unhygienic for others.
The list goes on, but you get the idea.
As a result, many homeowners who can afford it get their own private washing machine at home:
It has an initial cost to buy and install it
It takes space
The owner must pay the cost of electricity, water, and service
If it breaks it’s up to the owner to fix it
The owner must read the manual to understand how it works
The washing machines that are made for private users usually have more programs but less capacity
You probably guessed where I’m going with this. The private washing machine vs community laundry room is not a clear-cut choice. There are trade-offs to be made.
The same concept applies to central platforms vs bespoke platform like the story I told above.
Shadow IT vs centralized IT
In big organizations, shadow IT refers to information technology (IT) systems deployed by departments other than the central IT department, to bypass limitations and restrictions that have been imposed by central information systems. While it can promote innovation and productivity, shadow IT introduces security risks and compliance concerns, especially when such systems are not aligned with corporate governance. —Wikipedia
Efficiency is often considered the main reason to centralize platforms. As someone who has experienced both the “community laundry room” and “private washing machine” I see a multi-variable equation:
Initial cost: the central platforms usually need to accommodate the needs of a larger and more diverse range of applications. Platform engineers are generally more expensive to hire and retain. As a result, many companies burn through their platform budget with consultants who specialize in building platforms. Once the consultants build the platform, you’re usually stuck with a white-label vendor locked in: you’re going to need the same consultant to run it for you.
Flexibility and control: bespoke platforms enjoy a higher degree of autonomy because they don’t have to go through a gate keeping process for every change they want to make to their platform. This also means more responsibility because when things break, they’re responsible to bring it back again. You may remember that mandate and responsibility are two elements of the full ownership model I wrote about before: you build it, you run it.
Maintenance and operation: depending on the industry and offering, the central platform teams take up 20-50% of the engineering workforce. Although you could argue that in a bespoke platform setup, maybe even more engineers work with platform, most of those engineers don’t work with the platform full-time. Part of their time goes to the actual application that is bringing money to the company and that’s a good thing. Not just for efficiency but also effectiveness: due to better understanding of the application needs, they tweak the platform to fit the application better. This is the knowledge element in the full ownership model.
Risk tolerance: If a private washing machine fails, it’s a bad day for the owner but it doesn’t take down the entire community laundry room. Central platforms put a lot of eggs in one basket. Their SLO (service level objective) is set by the most sensitive system that runs on them. And this makes maintenance more costly. You’ve probably got spammed by your platform team trying to upgrade the version of some dependency.
Observability, Compliance, and security: it is usually cheaper to audit and secure a central platform. Observability (logs, metrics, traces) is usually more mature when everyone is using the same platform. If you’ve tried tracing across a heterogeneous setup (eg. Kubernetes, EC2, Elastic, Graphana, etc.) you know what kind of pain I’m talking about. That is because the central platform teams are usually more homogeneous and easier to change. The diversity and lack of visibility into bespoke platform makes it a nightmare to keep secure and compliant. The “you build it, you run it” imposes a level of cognitive load to the teams owning those platforms that may not be desired to affordable for the business.
Cognitive load: the central platform teams usually have dedicated infrastructure engineers who do this for a job. Some of them may call themselves DevOps but unless they’re breaking the wall between developers and operation, it’s a lie. I’ve also seen some of them call themselves SREs but unless they spend the significant time coding and applying software engineering concepts to operational problems that’s a lie too. “But Alex, we use IaC (infrastructure as code)”! Yeah, an ostrich has feathers too! 🪶
Isolation: Central platforms are often quick to forget that the company most probably doesn’t make any money from selling the platform! The application developers who build on top of the platform should be treated as valued customers. Unfortunately, the central platform teams optimize for themselves and instead of interacting with the application developers, make them go through some sort of ticketing system (Jira, ServiceNow, etc.). This isolation reduces the evolution speed of the platform because the typical application developer is missing from the platform discussions. The platform team sees themselves as babysitters. (babysitter is one of the 6 archetypes of the broken ownership)
I could continue this list, but I know you’re busy so let’s dig into when to use what.
When to use what?
Central platforms usually have higher quality and increase compliance, security, observability, and efficiency due to economy of scale.
Bespoke platforms usually have higher flexibility and increase autonomy, resilience, and efficiency due to multi-disciplinary workforce.
Which one is better? It depends on many factors
Repetitive work: if you’re in the business of creating new services left and right, you can go for a central platform that solves service provisioning once and provides the features to observe, secure and control those services out of the box. It makes service creation and operation cheaper. This kind of investment has the highest ROI (return on investment) when made early.
Diverse workload: if your applications workloads are different enough to benefit a range of different platform offering, you could probably start with separate platforms for each and then unify when patterns emerge. For example, the backends that serve images for a media site are bandwidth heavy and light on processing. On the other hand, the backends that train an AI model require massive machines with beefy GPUs. The two require different network topologies, storage solution and even the deployment tooling and pattern can be different. Is it possible to have one platform offering that’s flexible enough to support any workload? Of course! That’s what Amazon, Azure, and Google did. But they’re in the business of selling platforms. You probably are not. If you want to apply the concepts of lean startup, start small and don’t solve tomorrow’s problem on day 1. The way you build an industrial level washing machine is different than a consumer level washing machine. Sometimes you start hand washing before you even need a machine! That’s a respectable option.
Risk tolerance: if your industry is heavily regulated (e.g., healthcare, finance, etc.) or has high security requirements (e.g., defense, infrastructure, etc.), central platforms make it significantly cheaper to be on top of those risks through a combination of gate keeping, access control, dedicated SME (subject matter expert) roles, and tooling. On the other hand, multiple platform teams diversify that risk. One platform failure doesn’t impact the others (the apps running on top of those platforms may do, but that’s another topic. I have a series of posts about application resilience patterns in my book: Reliability Engineering Mindset).
Naïve leadership: originally, I was not going to write this, but it happens too often so I’m going to call it out. If your engineering leadership is naïve enough to sign the hefty budget for a central platform team without asking the right questions, the company pays the ultimate bill for hiring incompetent leaders and putting them in a position of power. If the leadership doesn’t have a realistic picture of the business, risks, architecture and the type of workload, they’re inclined to fall back to what they have seen at a previous job. One of the primary skills of a good leader is to know who to trust and for what. Engineers often do what engineers do: they overengineer. Seniors are worse: they know enough to be dangerous but dogmatic enough to use the best practice in the wrong place. You think you're saving money through unification, but what you do is to create friction... at scale!
In an upcoming article I will distinguish between the senior engineers and the mature engineer. If you want to get it in your mailbox when it’s out, here’s your chance:
Central platforms feed shadow IT
The downsides of community laundry room frustrate people to use their own private washing machine. That is, if they can afford it.
It is a bit counterintuitive that the central platforms are responsible for creating shadow IT! This also increases the cost of the central platform: fewer users mean less reason to justify the cost.
How to avoid unnecessary shadow IT?
The most important reason that teams build their own platform is that the total cost of having their own platform is overall better than using a central solution.
I know we talked about washing machines but let me just drop this here:
Internalize the fact that the application developers are closer to the business (how the money is made) and should be treated as VIPs (very important persons):
Invest in DX (developer experience): easier onboarding, better operation tooling, better documentation, more direct service, etc. Get out of their way and let them deliver. That’s the main value proposition for every platform team. When I was at a central platform, I was tasked to set a SLO (service level objective) for us. While doing research about how our consumers perceived reliability, I came across cases where a developer had to go through seven infrastructure teams 🤯 to get something as simple as adding a new endpoint done. We were optimized for us, not for our consumers. And the company paid the final bill due to poor developer productivity.
Trust the devs: Developers are smart. Trust me, I’ve worked both as developer and SRE. I had more responsibility as an SRE, but my brain worked harder when I was a developer. Don’t act like a babysitter. Give them access when they need it and educate them when they require it. Your primary objective should be to empower developers and deprecate yourself. You should never optimize for yourself to have job by holding the keys and keeping away knowledge. That’s an effective way to make sure you don’t grow beyond your limited knowledge. Treat developers with respect and they’ll lift your level too.
Frictionless support: A common anti-pattern is ticketing. It acts as a wall between your platform team and the developers. If you have ticketing, you can be certain that you haven’t hired proper DevOps because as mentioned breaking the wall between devs and operation is one of the main DevOps principless. If you treat your colleagues like external customers, you haven’t understood the nuances of a product platform.
Platform as a product: treat your platform as a product with a properly technical product manager. Your consumers and stakeholders are internal but that doesn’t mean you should pretend that they don’t exist. As a rule of thumb, the closer you are to the business customers (e.g., end users), the more likely you can create a product market fit. Of course, they are going to come to you with edge case requests and of course your product manager is empowered to say “no”. But remember that every unjust “no” is a step to drive them to work around your entire team and build their own platform.
Never play the mandate card: when you spot a bespoke platform solution, it is tempting to fight it, escalate and force them to use your platform but let’s face it, no one likes to be forced to use a product, even in communist countries! Treat the existence of bespoke platform as an incident. It’s a symptom. The real root cause is most probably in your central platform offering. Try to learn from it and create a product that’s so good they can’t resist paying the migration price. That’s the bar you should be aiming for. Not holding them at gunpoint to migrate. Because the product may eventually migrate, not the people’s heart.
Be quick: when application developers have a request, their request is put into a backlog and weighed against every other request from across the organization. Sometimes it leads to back-and-forth discussion when eventually the developers have to refactor or accommodate for lack of the central platform. Sometimes, it’s going to be implemented but by the time the central platform is ready to ship it at their own pace, the developers might have moved on to their own alternative solution.
Remember: platform is not rocket science.
Developers are smart. For your platform to be a viable product it needs to be cheaper than a bespoke solution. Be service minded.