Day 35 of 60
·
Load, chaos, durability
Chaos engineering / fault injection
You have retries. You have timeouts. Have you ever tested them? The dependency outage you've been preparing for is the one your retry logic was always going to mishandle.
ProblemDependencies fail in production. Did your retry/timeout/fallback actually work?
How it works
Inject latency, errors, and outages into staging or even production. Toxiproxy at the network level; LitmusChaos / Chaos Mesh at the platform level.
What it catches
Resilience gaps, missing timeouts, unbounded retries, cascading failures, leader-election bugs. Promoted to "standard practice" per 2026 industry surveys; not adopting it past Large is a documented gap.
Tools
Toxiproxy · OSS Chaos Mesh · OSS LitmusChaos · OSS Gremlin · SaaS
Verdict by project size
Small
Skip
Medium
Rec
Large
Must
Extra-large
Must
Cost
| Project size | Setup | Maint / mo | Tool / mo | CI / run |
|---|---|---|---|---|
| Small <10k LOC | 1d | 1h | $0 | , |
| Medium 10–100k LOC | 5d | 5h | $0 | , |
| Large 100k–1M LOC | 20d | 30h | $1k | , |
| Extra-large >1M LOC | 80d | 150h | $10k | , |
Setup = engineer-days to first useful run ·
Maint = engineer-hours / month at steady state ·
Tool = out-of-pocket $ / month ·
CI = minutes added (or saved) per pipeline run
Lifecycle & ownership
When in lifecycle
Test Release Operate
Per release · Runs before promotion to production.
Who owns it
SRE / DevOps / Platform
CI/CD, observability, reliability
Collaborates with: Developer
Reference implementations
-
Chaos Mesh examples
Kubernetes fault-injection experiments for resilience validation.
-
LitmusChaos experiments
Reusable Kubernetes chaos experiment definitions.
-
Toxiproxy
Network fault injection for latency, timeout, and dependency failure testing.
Quick check
Chaos engineering exists to validate…
One question. Pick the best answer. Your streak is saved locally on this device.
Save the lesson
Download SVG ↓Screenshot for a 1:1, drop it in Slack, or download the SVG.