Two years have passed since my first blog post about Retries with Resilience4j, where I promised a second post about Circuit Breakers. Here it is!
In our previous blog post, we added to our . The gateway application will retry failing requests to one of the product microservices up to three times and provide a fallback value if the third retry isn’t successful.

Adding the retries increased the overall reliability of the gateway. For example, users get hot-deals if there’s a minor hiccup, and the service can respond on a second try. Thanks to the fallback value, users can get products from the other categories if the hot-deals microservice does not respond successfully during the retries.
However, this could lead to a follow-up problem. Imagine you are feeling bad and want to stay in bed that day. Now all your colleagues are asking you the whole day: “Are you feeling better?”, “Are you feeling better?” “Are you鈥.” It’s not the best situation to recover.
The same situation could happen to a microservice that is having some troubles. It could just be restarting. Due to the retries, the microservice receives way more requests- at least three times more. And you also know your user. Quickly hitting F5 is still the best option to recover from any problems. So, the load could heavily increase in turbulent situations, reinforcing the problems. Martin Fowler calls it a .
This is where the concept of circuit breaker makes sense. I couldn’t explain it better than Martin Fowler did in his .
“The basic idea behind the circuit breaker is straightforward. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error without the protected call being made at all. ”
A circuit breaker has three states.
CLOSED: Everything is working as expected; no problems, and calls to the protected remote system are made.OPEN: The configured error threshold has been reached. No calls to the protected remote system are made until a given duration has elapsed.HALF_OPEN: In this state, a few requests to the remote system are allowed to test whether the system is responding again and if the circuit breaker can be closed.You can look at the of the circuit breaker implementation in resilience4j to better understand how a circuit breaker’s behavior can be fine-tuned.
To add a circuit breaker, we use another Resilience4j annotation, @CircuitBreaker next to our @Retry -Annotation from our previous blog.
@Retry(name = "fashion", fallbackMethod = "getProductsFallbackRetry")
@CircuitBreaker(name = "fashion", fallbackMethod = "getProductsFallbackCircuitBreaker")
public List<Product> getFashionWithRetryAndCircuitBreaker() {
聽聽聽鈥
}
Next, we add some configuration as we don’t want to use the defaults for the circuit breaker in our case:
resilience4j: 听肠颈谤肠耻颈迟产谤别补办别谤: 听听听颈苍蝉迟补苍肠别蝉: 听听听听听蹿补蝉丑颈辞苍: 聽聽聽聽聽聽聽registerHealthIndicator: true 聽聽聽聽聽聽聽slidingWindowType: COUNT_BASED 聽聽聽聽聽聽聽slidingWindowSize: 100 聽聽聽聽聽聽聽minimumNumberOfCalls: 10 聽聽聽聽聽聽聽聽waitDurationInOpenState: 10000
Same story as in the previous blog post. You could write an integration test, for example, using @SpringBootTest. An example of circuit breakers can be found . But again, wouldn’t it be nice to see the effects in your real-world environment?
Let鈥檚 use 黑料社 to have a closer look and implement a nice experiment.
As good chaos engineers, we always start with a hypothesis for an experiment. What would we expect?
CLOSED
OPEN after the attackOPEN Circuit-Breaker is used.CLOSED again.
Lane 1
Lane 2
Lane 3
registerHealthIndicator: true. We are using HTTP checks to check the state of the circuit breaker.1. Check if the state is CLOSED2. Start a targeting the hot-deals microservice for 15 seconds3. Check if the state is OPEN4. Wait 20 seconds
5. Check if the state is back to CLOSED

Great! The experiment ran successfully without any errors. The three checks for the states of the circuit breaker succeeded. All requests to the gateway endpoint have been answered within a reasonable time.
But wait 鈥 Didn’t I expect the “stepped” response times as we had with the retries? At least as long as the circuit breaker is in OPEN state. We need help seeing steps in our response time. I expected that the @Retry would be handled first, and after that, the @CircuitBreaker would add its magic. A closer look into the confirms what our experiment was showing.
The Resilience4j Aspects order is the following:
Retry ( CircuitBreaker ( RateLimiter ( TimeLimiter ( Bulkhead ( Function ) ) ) ) )
This way, the Retry is applied at the end (if needed).
So the Circuit-Breaker is returning its fallback and the Retry is no longer being used with our configuration. We need to change the circuitBreakerAspectOrder and the retryAspectOrder to have the Retry before the CircuitBreaker.
Conducting chaos experiments yields a wealth of insights about your system. In addition to improving reliability, this is a pivotal aspect when utilizing our solution. Fresh discoveries are perpetually on the horizon, as software often exhibits behavior distinct from our expectations. In intricate systems, integrated testing is an indispensable practice.
If you want to get started with 黑料社 to run your own experiments, you can or request a demo.
The circuit breaker is introduced to enhance resilience in our application by preventing cascading failures. It allows the system to gracefully handle errors and maintain functionality even when certain components are under stress.
To implement a circuit breaker with resilience4j, we utilize specific annotations provided by the library. This allows us to define the behavior of the circuit breaker, such as thresholds for failure rates and recovery time.
When testing the circuit breaker, you can expect to simulate various failure scenarios to observe how the system reacts. The goal is to ensure that the circuit breaker correctly prevents requests from going to failing services and provides useful insights during failures.
Forming a hypothesis is crucial as it sets a clear expectation for what we believe will happen during chaos experiments. It helps guide our tests and allows us to measure outcomes against our initial assumptions, ultimately improving our understanding of system behavior.
The experiment design involves deploying a gateway in Kubernetes with a replica configuration. This setup allows us to test how well our circuit breaker functions under load and failure conditions, ensuring that it can handle real-world scenarios effectively.
The experiment ran successfully without any errors, indicating that our implementation of the circuit breaker worked as intended. This success reinforces our confidence in using chaos engineering practices to improve system resilience.