Day 35 of 60 · Load, chaos, durability

Chaos engineering / fault injection

You have retries. You have timeouts. Have you ever tested them? The dependency outage you've been preparing for is the one your retry logic was always going to mishandle.

ProblemDependencies fail in production. Did your retry/timeout/fallback actually work?

How it works

Inject latency, errors, and outages into staging or even production. Toxiproxy at the network level; LitmusChaos / Chaos Mesh at the platform level.

What it catches

Resilience gaps, missing timeouts, unbounded retries, cascading failures, leader-election bugs. Promoted to "standard practice" per 2026 industry surveys; not adopting it past Large is a documented gap.

Tools

Toxiproxy · OSS Chaos Mesh · OSS LitmusChaos · OSS Gremlin · SaaS

Verdict by project size

Small

Skip

Medium

Rec

Large

Must

Extra-large

Must

Cost

Project size	Setup	Maint / mo	Tool / mo	CI / run
Small <10k LOC	1d	1h	$0	,
Medium 10–100k LOC	5d	5h	$0	,
Large 100k–1M LOC	20d	30h	$1k	,
Extra-large >1M LOC	80d	150h	$10k	,

Setup = engineer-days to first useful run · Maint = engineer-hours / month at steady state · Tool = out-of-pocket $ / month · CI = minutes added (or saved) per pipeline run

Lifecycle & ownership

When in lifecycle

Test Release Operate

Per release · Runs before promotion to production.

Who owns it

SRE / DevOps / Platform

CI/CD, observability, reliability

Collaborates with: Developer

Reference implementations

Chaos Mesh examples
Kubernetes fault-injection experiments for resilience validation.
LitmusChaos experiments
Reusable Kubernetes chaos experiment definitions.
Toxiproxy
Network fault injection for latency, timeout, and dependency failure testing.

Quick check

Chaos engineering exists to validate…

One question. Pick the best answer. Your streak is saved locally on this device.

Save the lesson

Download SVG ↓

Screenshot for a 1:1, drop it in Slack, or download the SVG.

All 60 days →