We invested 10% to pay back tech debt; Here's what happened
Why and how we continuously invested the team bandwidth to pay back tech debt and what were the results?
Anyone who has maintained software for a while knows that it tends to rot over time. It takes deliberate effort to prevent that from happening. In this post I will tell a story about how one team successfully dealt with it and conclude with some practical tips.
Decreasing MTBF (mean time between failure): the software fails more often and there are increasingly more incidents.
Increasing LT (lead time): for features that have similar user value, the time it takes for implementation, review, deploy and release increases over time.
Decreased efficiency: the ratio of value divided by effort, drops
Increasing TTR (time to repair or remedy): it takes longer to fix software defect (repair) and ensuring it does not happen again (remedy). (see my article on InfoQ about MTT metrics)
Increasing TTFC (time to first commit): one of several metrics that aim to measure the effectiveness of onboarding a new person to the codebase.
The root causes are generally:
External: the runtime, operating system, dependencies, change over time and require the owners to adapt.
Internal: bugs, config drift, tech debt
Hybrid: The requirements and user demands change faster than the team can satisfy it with the code in hand
Of all those causes, tech debt is one that a development team can control.
I will not bore you with what is already on the internet on the topic:
Martin Fowler named 4 types of tech debt
- wrote about the pragmatic middle ground back in 2020
David Pereira wrote about the PM’s take on it
Devopedia has a page about the history of the term
Wikipedia has a list of causes
Instead, I will tell a story of one of the most successful ways to deal with it and conclude with some practical tips.
Years ago, I was collaborating with a team of 12 engineers behind two large full stack applications. Each app had +180K SLOC (source lines of code excluding dependencies, comments and empty lines but including the tests).
The code itself was the result of platformization of a bespoke solution that was built a few years back. At some point, the company had multiple solutions for solving the same problem. So, it reasonably decided to pick the most mature solution, generalize it to a platform, and assemble a team of A-players to own it.
That’s where I came in. I was an internal transfer from another cluster. From the TDP trio (tech, domain, people), I was familiar with the tech and people but relatively new to the domain.
My struggles started from day one. I could not make sense of the code base and felt frustrated. At the time I had 19 years of programming experience, the last seven of which was specifically on the technology those apps were using.
Almost everyone in the team had less experience than me (at least on paper) yet a simple task would take me multiple days longer than I thought. Yet, I felt dumb and helpless.
Fortunately, some of the creators of the original codebase were with the platform team and could give me a grand total of 2 hours intro. More than helping me understand the code, the intro helped me understand the history, mentality and the larger forces at play which shaped the code.
You see, the leadership did not care about the code quality as long as the stories were delivered on time. Corners were cut, tests were skipped, and I kid you not, there was a sign on the wall that read:
I kept my feelings to myself. Obviously, the guy who asked me to join the team (one of the senior directors in that cluster) had other plans. Maybe it was a test to see how I would react? I was new to the team and had to build credibility before I could steer any change. Plus, as I often say: “Understand before trying to change.” For all I knew, the code and people are inseparable. You cannot fix cultural issues with technical solutions.
My first real contribution to the team was to put this witty picture on the wall. It was received well.
If it were today, I would put this on the wall:
Have you heard of the broken window theory? In the context of software maintainability, it means: the more tech debt there is, the less care is given to new development.
In other words, the negative effect of tech debt compounds over time.
Turns out I was not alone in my frustration. Tech debt consistently kept coming up retro after retro until the management decided to take it seriously and do something about it.
So, we had a workshop to drill deeper into this issue: understand why it is happening and how we can take control. The team had an honest conversation and my respect for the team grew. Turns out, their stressful days did not leave much time for cleaning up the mess. Who would have thought? To their credit, I came in when the code was like a crumbling Jenga tower:
They were pretty aware of the issue, but the main problems were lack of time and knowledge of best practices.
I am paraphrasing here since it was a few years ago but if I recall correctly, one developer said:
There is so much tech debt that we should park all regular activities and go fix that for six months.
But we can’t do that. Who is going to run the product add new features while we’re paying the tech debt? How about breaking the work into smaller parts and gradually doing overtime in parallel with regular tickets?
Another developer said:
If I am going to clean up the code, I need dedicated time that is not planned for regular work like bug fix or features.
Another developer said:
It would be nice if we could collaborate on the cleanup. That way we can put our minds together to find the best approach and divide the mechanical work. It is not fair for one person to do the cleanup. There’s also a risk that if one brain is tasked to fix tech debt, we may end up with creating more of it.
The EM asked:
We need to timebox that activity so it does not swallow the time that should go to features and bug fixes.
How much time do you think is fair to spend on fixing tech debt?
This started a longer discussion, but the consensus was 1 day a week which translates to 20% of the team’s bandwidth.
So, you are telling me that we must spend 20% of our time just keeping the lights on?
This followed a bit of an awkward silence as if the question were: “Would you go down 20% in your salary?”
The PM continued:
We do have a backlog to deliver so we need a trade-off that balances the two types of tasks. How about 10%?
And the rest is history. The “Tech Debt Friday” was born. Why Friday? I do not remember, but it had something to do with the fact that some people were off on Friday so in practice, tech debt would not “steal” 10% sharp. Still a victory! ✌️
It took a few iterations to mature the process. But it stayed unchanged for the last year I was with that team. Even when the EM and PM changed, the team successfully onboarded the new managers to this “tradition.”
Do you like this kind of posts? Why not subscribe for free to get the latest posts directly to your mailbox?
Every other week, we had the "Tech Debt Friday". These days were not planned for a specific issue or story. We had a one pager policy that I forgot to copy, but it went something like this:
We spend 10% of our time dealing with tech debt.
The first rule is not to create tech debt in the first place.
The PR (Pull Request) that creates tech debt should come paired with the issue to deal with it.
All Tech Debt work is recorded as issues and labeled “tech-debt.”
We deal with tech debt at the same time. Keep that day light on meetings.
It is recommended (but optional) to demo the results in the next team demo.
When changing code to fix tech debt, add/update the tests and documentation.
Engineers looked forward to the Tech Debt Friday. The team would happily remind management that this day cannot (under any circumstances) be planned for regular feature/bugfix work. Although we fixed some bugs along the way, this was primarily an investment to make future feature development cheaper while improving the maintainability and reliability.
Initially it was hard to defend spending 10% of the team bandwidth on tech debt, but over time the payback was huge:
We amortized the debt as soon as we could. Typically, in less than 10 days.
Due to lack of formal structure of assigning issues and stories, these days were one of my favorite mob programming sessions. We investigated the code base together and learned about the history of it.
Turns out some apparent tech debt was actually code that was better left untouched, had there been better documentation. We documented what we would not refactor or remove.
Gradually, we started to game the system. Having to regularly deal with the debt made us more conscious not to create it in the first place. That way, we would spend the Tech Debt Friday on more meaningful work like improving our tests, linting or CI/CD pipeline to prevent errors or make them cheaper.
Having dealt with tech debt in a collaborative manner enabled us to do the “regular work” faster because we had a better collective understanding of the code, and the code was cleaner to work with. One could argue this is just a positive effect of mob programming, but the lack of a concrete agenda also helped the autonomy that unlocked creativity.
Better clarity on the design and architecture of the code enabled us to make better judgement calls when we had to cut corners due to the time constraints. We would have a better idea what does it takes to deal with the tech debt and whether it is worth it to borrow time from the future to ship faster.
The management gradually started to like this practice too because tech debt would not steal from regular work or cause embarrassingly unnecessary incidents. Plus, this freedom and trust boosted the team spirit. The team was treated like adults, so it behaved accordingly.
Other teams in the company started experimenting with the Tech Debt Friday
Just because it works, it doesn’t mean that it’s done properly. The problem with code is that its structure is hard to visualize in a way that is approachable to the decision-making structure (e.g. PM) to get the time and fix it. How about this?
Tech debt is like fast food. It is ok when you have no better choice, but one needs to do the workout to reduce its negative side effects. (this sentence is stolen from my own guiding principles)
On that line of argument "code obesity" should totally be a thing! 😄
Tech debt is "short-sightedness ransom" because only tactical and short-term thinking ignores it. More on the difference between tactical and strategic thinking here.
Robert Kiyosaki (the author of Rich Dad Poor Dad) famously said:
Bad debt takes money out of your pocket. Good debt puts money in your pocket.
There is good tech debt and bad one:
Good tech debt is a deliberate and conscious trade-off to get the results out. It is a great tool to accelerate discovery and is paid back ASAP.
Bad tech debt which is commonly associated with negativity is about postponing work out of laziness and for no clear gain. It is like borrowing money to buy a pair of expensive shoes just to feel good!
If you cannot pay for it, do not take it. Two big reasons behind unpaid tech debt:
Poor engineering: underestimating the work required to create a maintainable product.
Poor leadership: if engineers must sell the necessity of paying back tech debt to a gatekeeper, you need to ask why the gatekeeper exists at all.
Alas, poor engineers, and poor leaders fit together like a door and its frame. It is hard to break such symbiotic relationship. It may be easier to abandon ship and join a team which cares about software sustainability.
These posts take anything from a few hours to days. I am experimenting with an idea to monetize this time without being an a**. The main chunk of the article is available for free. For those who spare a few bucks, the “pro tips” section is a token of appreciation. And for those who choose not to subscribe, it is fine too. I appreciate your time reading this far. 🙏You can also follow me on LinkedIn where I share useful tips about technical leadership and growth mindset.
Right now, there is a 20% discount via this link. 💫