Discover more from Alex Ewerlöf Notes
What is it? How does it work? When to use it and when not to use it?
Fallback and failover are sometimes confused with each other.
It is easy to see why:
They are both F-words! 😄 The words “fail” and “fall” are conceptually related.
They both rely on some predefined backup plan.
They are both strategies to reduce the risk of failures.
They both increase the system complexity and cost.
You may use one or both in a given system architecture.
We need to distinguish between relevant terms because what you think may not be what you say and what you say may not be what is heard and what is heard may not be what is implemented!
In a previous article we discussed failover.
This article digs into different types of fallback mechanisms, how they work, and some examples.
We’ll dig into other terms like roll back, fix forward, blue-green, canary, etc. If you want to receive those articles when they’re out, you can subscribe for free here:
What is fallback?
Fallback is a risk mitigation strategy for reducing the negative impact of failures in a cost-effective manner.
Where failover uses the same type of solution to achieve the same outcome, fallback uses a different type of solution to maintain the essential system functionality.
How does fallback work?
In a nutshell here’s how fallback works:
The primary system is responsible for handling the load while the fallback system is ready to take over when needed. The fallback system is usually a different type than the primary system and we will discuss why.
If the primary system fails, the load is shifted to the secondary systems. The detection and load switching can either be manual or automatic.
Fallback buys some time to fix the issue with the primary system. The usage of fallback may create some manual work that we must deal with afterwards (see the examples)
Once the primary system is back up again, it’ll take over handling the load.
A common side effect of this mitigation strategy is that the end user may notice a difference between when the primary or the fallback systems are in charge. This contrasts with failover which aims for seamless mitigation from the users’ point of view.
However, fallback is usually cheaper, easier to implement and sometimes the only viable option.
As we’ll see in the examples, it is common for fallback to be combined with failover for maximum service continuity.
Key characteristics of fallback
Service degradation: Fallback relies on an alternative mechanism to achieve the essential system functionality. But having something is better than nothing.
Cost effectiveness: Fallback is not to be used that often therefore it is fine to have a solution that has a poor performance/cost ratio compared to the primary system when it is in use.
Toil: Unlike failover which heavily relies on automation, it is fine for fallback to involve some level of manual work for detecting the failure, switching the load, and recovering from failure.
Not used that often: due to manual effort and degraded service level, fallback is not to be used regularly. It exists to improve the service level in the rare case of an emergency.
There are of course products that rely heavily on fallback due to cost efficiency or simplicity. It might be totally fine if high availability or seamless experience is not a requirement.
On the other hand, there may be cases of over-engineering when the value of higher reliability is not justified by the cost of implementing a fallback or failover.
Types of fallback mechanisms
Limited Functionality: The fallback mechanism provides a reduced set of features or functionality compared to the primary system. It allows the system to continue operating with the essential feature set until the primary system is restored. Example: Windows Safe Mode.
Poor Data: cache, secondary database that only contains the most important information. Example: A mobile app using cached data when the backend is not accessible.
Lower Performance: The fallback system operates at a lower performance level compared to the primary system. This may result in slower response times, decreased throughput, or other performance limitations. However, it still provides the essential functionality. Example: serving from the backends if the cache server dies.
Higher Cost: Fallback is cost-effective but only when comparing the cost of total failure against the cost of running an alternative solution. It may still cost more than the primary solution when we factor in the system complexity, runtime cost, redundant resources, and maintenance. Example: using a different payment provider with a higher fee.
Note: The maintenance banner is not a fallback system because the service is not usable. Example: Apple famously shuts down their website prior to big announcements. During these maintenance windows, users cannot buy products which in Apple’s eyes is a reasonable risk to take.
Homogeneous Fallback: the fallback system is the same type or even sometimes the same instance as the primary system. The main difference is the degradation. For example, a mobile app may be fallback to offline mode if the backend is not available or accessible. This may hurt the user experience but is better than no experience at all (depending on the purpose of the app).
Heterogeneous Fallback: This type of fallback mechanism involves using a different type of solution or technology to fulfill the essential functionality. The secondary system may not have the exact same capabilities as the primary system, but it can still provide an acceptable level of service.
Change is the number one enemy of reliability. If we decouple the lifecycle of the two alternative solutions, we can improve reliability.
If the primary system fails due to some faulty code commit, bad data or misconfiguration, a heterogeneous fallback is more likely to be ready than a homogeneous fallback with its lifecycle tied to the primary system. Non-simultaneous failure translates to more reliability.
One straightforward way to make a homogeneous fallback system more robust is to decouple its lifecycle from the primary system. For example: keep a deprecated version of the service around. This solution is useful for high-risk situations like when a service is entirely rewritten or rearchitected and is just being released to production. By keeping the old service around, we can quickly shift the load if things go south.
As discussed for failover, there are 3 critical points where automation can improve reliability for what would otherwise be manual:
Detecting failure in the primary system
Switching Load to the fallback system (and back to the primary system once the issue is resolved)
Fail resolution and recovery
However, the cost of automation may be less than the value in case of fallback because:
Fallback is to be used as a last resort due to its service degradation aspect. The effort that goes into automation would be better spent in improving reliability of the primary system. The ROI (return of investment) in automation may be low, depending on the risk assessment and probability.
Since fallback often uses an entirely different technology, architecture, or functionality, automation may require skills and resources that are very different than the primary system. Sometimes it’s just cheaper and more accurate to have a human in the loop.
Fallback’s rare usage also means that any automation in place is put to test less often. It may easily lead to a false confidence where we think it’s going to work but it may fail miserably. The work around is to use fallback every now and then. Pretty much like how the fire alarms are tested regularly! 😀 If you have something like chaos monkey for chaos engineering, the risk of surprise failure in the fallback system can be lower.
Take the humble Windows PC as an example. If Windows fails, one can manually boot it into safe mode. The failure detection is often manual but can be automatic. The UX is limited but it’s better than a dead PC.
Once I was working on a news site. There was a cache/reverse proxy (Varnish) between our servers and the users’ browser.
Normally, the cache server would keep the content from the backend for a maximum of 5 minutes.
This is due to the nature of the application: news should be fresh, including the edits that the editors make to a news article after it is published, but it still needs to be cached to reduce the load on the backend from thousands of requests per second.
When the backend failed, the cache server would serve stale content to the users.
We also had a second fallback mechanism. Some articles were paywalled to generate revenue for the company. The paywall feature depended on the interface with another team’s product. If their product failed, the cache server would automatically bypass the paywall feature exposing the content for “free” until the issue was fixed.
The company would lose potential revenue from the content, but it was an accepted risk in comparison to taking the site down or showing an unpleasant error:
You can’t read this news article because we cannot verify if you’ve paid or not!
With these tricks we managed to have an extremely high availability service level, even though the systems that the cache server depended upon were not as available.
One company sold premium products online. The sales volume was not at eBay or Amazon level, but each order was extremely important for the revenue. We couldn’t afford to lose an order.
The fallback mechanism was to manually call the potential customer in case of a failure. This was viable because we did our best to obtain the contact information of the potential customers as soon as possible.
The extra manual work added to the cost of selling the product, but it was deemed acceptable due to the nature of the product.
In practice, this fallback increased the availability of the order flow to close to 100% even though the online shop’s availability was below 98%.
In their 2011 article Making the Netflix API More Resilient, Ben Shmaus writes about circuit breakers enabling 3 solutions:
Fail fast: there’s no good fallback mechanism and the UX is impacted. Although not ideal, it buys time for the API servers to recover quickly.
Fail silent: the API simply returns a null value for optional and non essential values
Custom fallback: used when some data is available locally (e.g., a cookie or local JVM cache) to generate a fallback response without having to make a call to a failed service or database
We will dig more into circuit breaker, retry logic, etc. in the upcoming articles about resilient architectural patterns
When to use Fallback?
If the primary system doesn’t fail often but you still need the guarantee to have something up and running if it ever fails.
If you can offer a reduced functionality in a cheaper way that worth it.
When failover is already in place, but you still need higher reliability.
When failover is not feasible or cost effective, but an alternative solution is feasible even with the involvement of manual work to detect, switch load, or resolve the failure.
When Recovery Time Objective (RTO) is flexible: If the system can tolerate some downtime or if the time to recover from a failure is not critical, fallback can be a viable option.
If service degradation is viable: Fallback can be suitable when the system can gracefully degrade while still providing essential services.
Fallback is a cost-effective risk mitigation strategy for service continuity. In some cases, it can be used instead of failover and in others it may complement failover.
Fallback usually uses a different solution to provide core functionalities. However, it is a tradeoff between degraded service, complexity, and potentially manual work against the cost of failure. Depending on the type of application, revenue model and non-functional requirements, sometimes it is acceptable to just fail the users than to mitigate using fallback or failover.
Due to service degradation, fallback is only to be used as a measure of last resort. It is not to be used too often. “Often” depends on the type of system, how reliability is perceived and the cost/benefit equation for having a more solid service level.
To reduce the negative impact of fallback on the consumers, you can use automation for detection, load switching and recovery.
This article took about 12 hours to research, draft, edit and illustrate. If you enjoy it, you could support my work by sparing a few bucks on a paid subscription. You get 20% off via this link. Thanks in advance. 🙌