The Circuit Breaker Pattern in Microservices - 18/11/2024
The Circuit Breaker Pattern: Stop Failing Requests From Taking Down Your System
The Circuit Breaker Pattern: Stop Failing Requests From Taking Down Your System
You’ve got Service A calling Service B. Service B starts timing out. Now Service A is holding connections open, waiting. Its thread pool fills up. Requests to Service A start failing too—even the ones that don’t touch Service B. Congratulations, you’ve got a cascading failure.
The circuit breaker pattern exists to stop this.
The Core Idea
A circuit breaker wraps your external calls and tracks failures. When failures cross a threshold, it “trips” and starts failing requests immediately—no waiting for timeouts, no wasting resources on calls that won’t succeed.
It has three states:
Closed — Normal operation. Requests go through. The breaker counts failures.
Open — Too many failures. Requests fail instantly without attempting the call. After a timeout, it moves to half-open.
Half-Open — Trial mode. A few requests are allowed through. If they succeed, back to closed. If they fail, back to open.
That’s it. The rest is tuning.
Implementation
Here’s a circuit breaker in TypeScript using opossum:
import CircuitBreaker from 'opossum';
const breaker = new CircuitBreaker(
async (url: string) => {
const res = await fetch(url);
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
},
{
failureThreshold: 50, // % of failures to trip
resetTimeout: 30_000, // ms before trying half-open
timeout: 5_000, // ms before a call is considered failed
volumeThreshold: 20, // min requests before threshold applies
}
);
breaker.on('open', () => console.log('Circuit opened'));
breaker.on('halfOpen', () => console.log('Circuit half-open, testing...'));
breaker.on('close', () => console.log('Circuit closed'));
// Use it
const data = await breaker.fire('https://api.example.com/data');
And in Python with pybreaker:
import pybreaker
import requests
breaker = pybreaker.CircuitBreaker(
fail_max=5, # failures before opening
reset_timeout=30, # seconds before half-open
listeners=[pybreaker.CircuitBreakerListener()]
)
@breaker
def call_api(url: str) -> dict:
response = requests.get(url, timeout=5)
response.raise_for_status()
return response.json()
# Use it
try:
data = call_api("https://api.example.com/data")
except pybreaker.CircuitBreakerError:
# Circuit is open, use fallback
data = get_cached_response()
Picking Thresholds
This is where most people screw up. Here’s a starting point:
| Parameter | Starting Value | Adjust When… |
|---|---|---|
| Failure threshold | 50% | Lower if failures are expensive, higher if service is flaky |
| Sampling window | 10-30 seconds | Shorter for fast services, longer for batch operations |
| Minimum throughput | 20 requests | Higher for high-traffic services to avoid noise |
| Break duration | 30 seconds | Match to typical recovery time of downstream service |
| Half-open test calls | 3-5 | More if you need higher confidence before closing |
The minimum throughput matters more than people think. Without it, two failures out of three requests (66% failure rate) trips the breaker—even though three requests isn’t statistically meaningful.
What To Do When the Circuit Opens
Failing fast is only half the job. You need a fallback strategy:
Return cached data — If you cached the last good response, serve that. Stale data beats no data for many use cases.
Degrade gracefully — Can’t reach the recommendation service? Show popular items instead of personalized ones.
Queue for retry — For non-time-sensitive operations, queue the request and process it when the circuit closes.
Return a sensible default — Sometimes a hardcoded fallback is fine. Can’t reach the feature flag service? Default to the safe option.
Just fail — Sometimes there’s no good fallback. That’s okay. Fail fast with a clear error rather than timing out.
const breaker = new CircuitBreaker(fetchUserRecommendations, {
failureThreshold: 50,
resetTimeout: 30_000,
});
breaker.fallback(() => getPopularItems()); // Fallback when circuit is open
const recommendations = await breaker.fire(userId);
When Circuit Breakers Go Wrong
Threshold too sensitive. A brief network blip trips the breaker and you’re rejecting good requests for 30 seconds. Fix: raise minimum throughput, lengthen sampling window.
Threshold too lax. By the time the breaker trips, the damage is done—your thread pools are already exhausted. Fix: lower the failure threshold, add timeout handling.
Break duration mismatch. Your downstream service takes 5 minutes to recover but your breaker retries every 30 seconds, hammering it with test requests. Fix: use exponential backoff for break duration or monitor downstream health directly.
Testing in half-open kills recovery. You allow 3 test requests in half-open. The downstream service can handle 1 request/second during recovery. Your 3 concurrent test requests overwhelm it, it fails, circuit opens again. Fix: space out half-open test requests.
Circuit breaker per instance vs shared. If each instance of your service has its own breaker, they’ll all independently hammer the recovering downstream service. Consider sharing circuit state via Redis or a service mesh.
Circuit Breakers vs. Other Patterns
Retries — Try again on failure. Useful for transient errors. Dangerous without a circuit breaker because you’ll retry forever against a dead service.
Timeouts — Stop waiting after X seconds. Essential, but you’re still consuming resources while waiting. Circuit breakers prevent the wait entirely.
Bulkheads — Isolate resources per dependency. If Service B is slow, it only exhausts its dedicated thread pool, not your entire application. Complementary to circuit breakers.
Rate limiting — Control how many requests you send. Protects the downstream service from you. Circuit breakers protect you from the downstream service.
Use them together:
Retry → Circuit Breaker → Timeout → Bulkhead → Actual Call
Should You Even Use One?
Circuit breakers add complexity. You need monitoring to see their state. You need fallbacks for when they’re open. You need to tune thresholds. You need to test failure scenarios.
Use a circuit breaker when:
- The downstream service is out of your control
- Failures in that service could cascade to your service
- You have meaningful fallback behavior
- The service has enough traffic to make statistical thresholds meaningful
Skip it when:
- You’re calling something local and fast
- There’s no fallback—if Service B is down, you’re down anyway
- Traffic is too low to get meaningful failure rates
- You’re already using a service mesh that handles this (Istio, Linkerd)
Integrating Without Wrapping Every Call
Nobody wants to wrap every HTTP call manually. Here are three approaches to integrate circuit breakers cleanly.
1. Wrap Your HTTP Client
Create a wrapper around your HTTP client that applies the circuit breaker automatically:
import CircuitBreaker from 'opossum';
import axios, {AxiosRequestConfig, AxiosResponse} from 'axios';
const breakers = new Map<string, CircuitBreaker>();
function getBreaker(baseURL: string): CircuitBreaker {
if (!breakers.has(baseURL)) {
const breaker = new CircuitBreaker(
(config: AxiosRequestConfig) => axios(config),
{failureThreshold: 50, resetTimeout: 30_000, timeout: 5_000}
);
breakers.set(baseURL, breaker);
}
return breakers.get(baseURL)!;
}
export const http = {
async get<T>(url: string, config?: AxiosRequestConfig): Promise<T> {
const base = new URL(url).origin;
const res = await getBreaker(base).fire({...config, method: 'GET', url});
return (res as AxiosResponse<T>).data;
},
async post<T>(url: string, data?: unknown, config?: AxiosRequestConfig): Promise<T> {
const base = new URL(url).origin;
const res = await getBreaker(base).fire({...config, method: 'POST', url, data});
return (res as AxiosResponse<T>).data;
},
// ... put, delete, etc.
};
// Usage — no wrapping needed
const user = await http.get<User>('https://api.example.com/users/1');
This gives you one circuit breaker per origin, created lazily. All calls through http are
protected automatically.
2. Axios Interceptors
If you’re already using axios instances, add the circuit breaker as an interceptor:
import CircuitBreaker from 'opossum';
import axios from 'axios';
export function createClient(baseURL: string) {
const client = axios.create({baseURL, timeout: 5_000});
const breaker = new CircuitBreaker(
(config) => axios({...config, baseURL}),
{failureThreshold: 50, resetTimeout: 30_000}
);
// Replace the request method
const originalRequest = client.request.bind(client);
client.request = (config) => breaker.fire(config) as Promise<any>;
// Convenience methods still work
return client;
}
// Usage
const paymentApi = createClient('https://payments.example.com');
const userApi = createClient('https://users.example.com');
// These are now protected
await paymentApi.post('/charge', {amount: 100});
await userApi.get('/users/1');
3. Service Classes with Dependency Injection
For larger projects, encapsulate each external service in a class:
import CircuitBreaker from 'opossum';
interface PaymentResult {
id: string;
status: string
}
class PaymentService {
private breaker: CircuitBreaker;
constructor(private baseUrl: string) {
this.breaker = new CircuitBreaker(
(path: string, init?: RequestInit) =>
fetch(`${this.baseUrl}${path}`, init).then(r => {
if (!r.ok) throw new Error(`HTTP ${r.status}`);
return r.json();
}),
{failureThreshold: 50, resetTimeout: 30_000}
);
this.breaker.fallback(() => ({id: '', status: 'pending_retry'}));
}
charge(amount: number): Promise<PaymentResult> {
return this.breaker.fire('/charge', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({amount}),
});
}
refund(paymentId: string): Promise<PaymentResult> {
return this.breaker.fire(`/refund/${paymentId}`, {method: 'POST'});
}
}
// Wire it up in your DI container
const paymentService = new PaymentService('https://payments.example.com');
// Usage — circuit breaker is invisible to callers
await paymentService.charge(100);
This is the cleanest approach for complex projects. Each service owns its circuit breaker config, fallbacks are defined in one place, and calling code doesn’t know or care about the circuit breaker.
Observability
A circuit breaker you can’t see is useless. At minimum, track:
- Current state (closed/open/half-open)
- State transition events
- Failure rate over time
- Requests rejected due to open circuit
breaker.on('stateChange', (state) => {
metrics.recordStateChange(state);
});
Or with pybreaker
class MetricsListener(pybreaker.CircuitBreakerListener):
def state_change(self, breaker, old, new):
metrics.record_state_change(old, new)
Set up alerts for state transitions. If your circuit to the payment service opens at 3am, you want to know.
Summary
Circuit breakers prevent cascading failures by failing fast when a downstream service is unhealthy. The pattern is simple—three states, threshold-based transitions—but the tuning is where the real work happens.
Start with conservative thresholds, monitor aggressively, and adjust based on real behavior. And always have a plan for what happens when the circuit opens.