In a distributed system, calls to different services might fail due to timeouts, network connection slowness, or overused resources. All these problems are for a short while and can be corrected by themselves. The cloud services should be designed to handle such events, and this can be implemented using a retry pattern.
In simple terms, the circuit breaker’s main function is to interrupt the current flow after a fault has been detected. What does it mean in technical terms?
Whenever an external system or a process is not working it prevents an entire system from getting failed. It is used to detect failures and design a program that prevents a failure from constantly recurring, during maintenance, temporary external system failure, or unexpected system difficulties.
However, there might be scenarios where the entire service is down, and the TAT for the same is longer, in such cases continuously retrying an operation that will not succeed & is pointless, instead of that the caller service should handle error considering the callee service is down for a more extended period.
Additionally, one part of the system could be configured to send error messages on timeouts, i.e. reply with an error message once the timeout period has crossed the threshold. But the problem here is that there will be concurrent calls to the caller services for the same operation, and it has to wait till the timeout period has expired. Thereby causing resources to be held, and that could be fatal to the entire system. Setting a shorter timeout won’t also solve this problem since if a service takes a more expected time than timeout to respond, it will fail every time.
Fault-Tolerant and Resiliency: Introduction to Circuit breaker pattern.
Cascading System Failures can be prevented using Circuit Breaker Pattern. Cascading Effect might have been introduced in the system due to the Retry Pattern added to improve the system’s overall resilience.
So circuit breakers can help the system to prevent the problem mentioned above. A circuit breaker precludes an application from repeatedly calling the caller services that are likely to fail. Also, it does that and makes sure that once the services are up, it will start invoking the operations.
A circuit breaker acts as a proxy for the operations that tend to fail. Based on the failure rates, the representative should decide whether the subsequent calls are to be forwarded to the services or not or return the exception or message that is configured.
So the operation to be performed is wrapped with a circuit breaker object, which monitors the failures. There are configured thresholds to indicate the timeout that considers the operations as failure/success. Once the losses reach the point, the circuit breaker trips and all requests passed to the circuit breaker fail.
The following points highlight the need for a circuit breaker pattern:
- Custom Fallback: We could return the data from other sources or cache values if there is a circuit trip situation in the system.
- Fail Early: We don’t have to wait for the timeout to be reached for every request. Once the circuit is open it will fail early thereby avoiding delay and blocking of threads.
- Avoid crashing the callee service: If many threads are in the waiting state for the other service to respond. It might also crash other functionalities which are not relevant and thereby bring the system to a halt.
- Automatic Healing: It will periodically check if the service is up.
The different States of Circuit breaker
- Closed State: The circuit breaker works perfectly in the closed state, it indicates that there are no failures that are crossing the threshold from the services that are being called.
- Open State: Once the configured threshold is crossed on failures, the circuit breaker trips and it is moved to an open state.
- Half Open: We also need an indication where we are assured that the services are up and the operations could proceed, so in this case, after a configured timeout the circuit breaker will pass some requests to check if the services are up and running. If the requests succeed the circuit breaker resets and moves to the closed state. If this request also fails the circuit breaker continues to be in the open state until the next configured sleep window is reached.
Example to example the above states.
A system is configured to call a service and the response time is 100-200 ms. We have configured the circuit breaker to trip once 75% of the request crosses this threshold in 10 minutes. The sleep window is 20 seconds. So if 100 calls are made and 80 calls take more than 200ms the circuit breaker trips. Not allowing any further requests to the service. After 20 seconds i.e., configured sleep window the circuit breaker will call the service again to check if the requests have succeeded and the response time is as per the configured one. If successful the circuit breaker moves back to the closed state, else it will still continue to be in the open state and it will retry again once the configured sleep window is reached i.e 20 seconds.
The time-series events below help us understand how the interaction between caller and callee service takes place with the mentioned average percentage of failures. The below is a more sophisticated way of implementing the Circuit Breaker Patterns where the system falls back to Closed State ONLY after n (In this example the no. of checks is 5) consecutive checks. Lesser than n number of consecutive successful checks will keep the system in Half Open State.
Elapsed Time (Min) | 0-1 |
1-2 |
2-3 |
3-4 |
4-5 |
5-6 |
6-7 |
7-8 |
8-9 |
9-10 |
10-11 |
11-12 |
12-13 |
13-14 | 14-15 | 15-16 | 16-17 |
Avg. Failures % |
0 |
50 |
77 |
80 |
80 |
81 |
81 |
82 |
83 |
88 |
70 |
75 |
70 |
70 |
60 |
60 |
60 |
State | Closed | Closed | Closed | Closed | Closed | Closed | Open | Open | Open | Open | Open | Open | Half-Open | Half-Open | Half-Open | Half-Open | Closed |
Calle Service | Called | Called | Called | Called | Called | Called | Not Called | Not Called | Not Called | Not Called | Not Called | Not Called | Not Called | Not Called | Not Called | Not Called | Called |
The Circuit Breaker design pattern is a pattern used in both monolithic- and microservice-based deployments. It helps the system prevent sending unnecessary loads to a failed callee service. In addition, it provides time to the backend service in order to recover from errors.