Reactive error recovery is not a single operator—it is a layered strategy. The canonical three-layer stack is: timeout to bound how long you wait, retry to attempt recovery, and circuit breaker to stop attempting when the target is clearly unavailable. Each layer serves a distinct purpose, and omitting any one of them exposes a different failure mode. Exponential backoff—waiting 1s, 2s, 4s, 8s between retries—prevents the thundering herd: the scenario where every client in a fleet retries simultaneously after a service blip, producing a second outage from retry traffic alone. AWS's builders library formalizes this: cap the maximum backoff interval, add jitter (randomized per-client delay), and treat retries as "selfish" requests that must not compound the victim's problems. AWS guidance on timeouts targets p99.9 for timeout thresholds rather than mean or median. In RxJS, `retryWhen` (deprecated in favor of `retry({delay})` in RxJS 7) accepts a notifier Observable that controls retry timing. Piping the error notification through `delayWhen(error => timer(Math.min(30000, 1000 * 2 ** attempt)))` implements capped exponential backoff declaratively. Reactor's `Flux.retryWhen(Retry.backoff(3, Duration.ofSeconds(1)).maxBackoff(Duration.ofSeconds(30)))` builds the same strategy with the `reactor-core-extras` retry utilities. `catchError` (RxJS) and `onErrorResume` (Reactor) provide fallback stream substitution: when the primary stream errors, switch to a cache, a default value, or a degraded response. This matches the Reactive Manifesto's Resilient pillar—failures are contained and isolated, with recovery strategies determined per component rather than systemically. Kotlin Flow's `retry(retries) { cause -> cause is IOException }` applies conditional retry logic—only retry on recoverable error types, let fatal errors propagate. Combined with `retryWhen` for backoff logic, it composes cleanly inside a coroutine scope. Uber's RAMEN push platform reduced p95 latency by 45% partly through better retry discipline on the gRPC streaming layer—knowing when to retry versus when to fail fast is as important as the retry strategy itself.
Comments on "Retry with Exponential Backoff & Error Recovery"
Create a free account or sign in to join the discussion.
Sign in to join the conversation