Nowadays cloud-based, microservice-based, or Internet-of-Things (IoT) applications every so often depend on communicating with other systems across an unreliable network, which leads to unavailability or unreachability for these systems due to transient faults such as network problems and timeouts, being offline, under load or maybe non-responsive.
Polly, a .NET resilience and transient-fault-handling library, offers multiple resilience policies which enable software architects to design suitable reactive strategies for handling transient faults, and also proactive strategies for promoting resilience and stability. Through this post, I will walk you through all the policies/strategies/approaches which Polly library offers to handle transient faults.
Reactive transient fault handling approaches
These short-term faults typically correct themselves after a short span of time, and a robust cloud application should be prepared to deal with them by using a strategy like the “Retry pattern”.
Technically Retry allows callers to retry operations in the anticipation that many faults are short-lived and may self-correct; retrying the operation may succeed maybe after a short delay.
Waiting between retries, allows faults to self-correct. For example, Practices such as ‘Exponential backoff’ and ‘jitter’ enhance this by scheduling retries to avoid them becoming sources of further load or spikes.
There can also be circumstances where faults are because of unexpected events that might take much longer to fix themselves. In these situations, it might be useless for an application to continually retry an operation that is unlikely to succeed. As an alternative, the application should be coded to accept that the operation has failed and handle the failure accordingly.
Using Http retries in these situations could lead to creating a Denial of Service (DoS) attack within your own software. Therefore, you need a defence barrier so that excessive requests stop when it is not worth to keep trying. That defence barrier is precisely the “Circuit Breaker”.
How does circuit-breaker work?
A circuit breaker design pattern perceives the level of faults in calls placed through it and prevents calls when a configured fault threshold is reached.
The circuit breaker can have the following states:
- Closed to Open
- Open to Half-Open
- Half-Open to Open
- Half-Open to Closed
When faults exceed the threshold, the circuit breaks itself (Opens). An open circuit will result in call failure placed through it instead of being actioned and throws an exception immediately. Which means that call is not attempted at all. Thus, it will significantly protect both, a faulting system from an extra load, and allows the working system to avoid placing calls which are unlikely to succeed. Failing instantly in this scenario usually also promotes a better user experience.
After a certain configured time has been elapsed, the circuit moves to a half-open state, where the next call will be considered as a trial to determine the faulty system’s well-being. Based on this trial call, the breaker decides whether to close the circuit (resume normal operation) or break it again.
Note: Circuit-breaker implementation in software systems is like what is in electrical wiring; substantial faults will ‘trip’ the circuit, protecting systems regulated by the circuit.
A Fallback policy defines how the operation should react in case, even with retries – or because of a broken circuit – the underlying operation fails repeatedly. How much well resilience-engineered your system, failures likely to occur. Fallback means defining a substitute as what you will do when that happens. Its a plan for failure, rather than leave it to have unpredictable effects on your system.
Proactive transient fault handling approaches
Retry and Circuit-Breaker are primary approaches for resilience to transient faults, however, both are reactive, as they react after the failure response to a call has been received.
What will happen if…???
- Response never comes
- The response so delayed that we do not wish to continue waiting
- Waiting-delay / Waiting-to-React logic could itself have problems
Therefore, to handle the above questions, is it possible to be more proactive in our approach to resource management and resilience?
Consider a scenario, wherein a high-throughput system, many calls can be put through to a recently failed system before the first timeout is received. For illustration (Example one), with 100 calls/second to a faulted system that is configured with a 10-second timeout, 1000 calls could have been placed before the first timeout is received. A circuit-breaker will react to this situation as soon as defined threshold failures are reached, but till then a resource bulge has certainly occurred and cannot be backtracked.
While Retry and Circuit-Breaker are reactive; Timeout, Bulkhead, and Caching policies configurations allow pre-emptive and proactive strategies. High-throughput systems can achieve increased resilience by explicitly managing load for stability. We will walk through each of these in the below sections one by one.
Timeout allows callers to walk away from an awaiting call. It will improve resilience by setting callers free when a response seems unlikely.
Moreover, as the above given ‘Example one’ data demonstrates, opting for timeouts can influence resource consumption in a faulting, high-throughput system.
These kinds of scenarios often lead to blocking up of threads or connections, the memory those awaiting calls consume, and which causes further failures. Consider how long you want to let your awaiting calls consume these costly system resources.
The excessive load is one of the main reasons for system instability and failure. Therefore, building resilience into systems involves explicitly dealing with that load and/or pro-actively scaling to support it.
Excessive load can either be due to:
- Genuine external demand, for example, spikes in user traffic
- Or maybe due to faulting scenarios, where large numbers of calls back up.
Bulkhead policies promote stability by directly managing load and thus resource consumption. The Polly Bulkhead limits parallelism of calls placed through it, with the option to queue and/or reject excessive calls.
- Bulkhead as isolation : Enforcing a bulkhead policy, a parallelism limit, around one stream of calls, limits the resource that stream of calls can consume. If that stream faults, it cannot consume all resources in the host and thus save bringing down the whole host.
- Bulkhead as load segregation : More than one Bulkhead policy can also be enforced in the same process to accomplish load segregation and relative resource allocation.
A good comparison will be the check-out lanes of supermarkets, where there are often distinct lanes for “baskets only” as opposed to full shopping carts. This segregation permits for ‘baskets-only’ always to check-out quickly. Otherwise, they could be blocked waiting behind an excess of full shopping carts.
Configuring multiple bulkhead policies to separate software operations delivers similar advantages, both in terms of relative resource allocation for different call streams; and in ensuring one kind of call cannot exclusively prevent another.
- Bulkhead as load-shedding : Bulkhead policies can also be configured to proactively reject calls beyond a certain limit.
Why actively reject calls when the host might yet have more capability to service them?
The answer is it depends on whether you prefer managed or unmanaged failure. Recommending explicit limits enables your systems to fail in predictable and testable ways. Intentionally ignoring the system’s capacity to the server does not mean there is not a limit; it just shows you do not know where the limit ends. Thus, your system is liable to unpredictable, unexpected failures.
Whatever that lessens network traffic, and overall call duration; in turn, rises resilience and improves user experience.
Caching can be configured with multiple caches – in-memory/local caches/distributed caches – in combination. Polly CachePolicy provisions multiple caches in the same call, using PolicyWrap.
Conclusion: This post walks us through multiple configurable resilience policies offered by Polly Library. Neova has the expertise to implement in business applications to make them fault resilient.
Note: For code level details you can reach GitHub community demonstrating all these configurations in detail.