Day 32 of 60
·
Load, chaos, durability
Game days & DR drills
Your runbook works on paper. Does the team work it at 3 AM, with the right person paged, with the right Slack channel still up? Drills tell you.
ProblemYour runbook works on paper. Does the team work it under stress at 3 AM?
How it works
Schedule a controlled outage. Page the on-call. Run the drill. Measure recovery time. Update the runbook.
What it catches
Process gaps, missing runbooks, miscalibrated alerting, incident-comms friction. The "soft" defects no tool finds.
Tools
(scenario design, no tooling) · OSS
Verdict by project size
Small
Skip
Medium
Skip
Large
Rec
Extra-large
Must
Cost
| Project size | Setup | Maint / mo | Tool / mo | CI / run |
|---|---|---|---|---|
| Small <10k LOC | 0h | 0h | $0 | , |
| Medium 10–100k LOC | 1d | 2h | $0 | , |
| Large 100k–1M LOC | 5d | 10h | $0 | , |
| Extra-large >1M LOC | 20d | 40h | $0 | , |
Setup = engineer-days to first useful run ·
Maint = engineer-hours / month at steady state ·
Tool = out-of-pocket $ / month ·
CI = minutes added (or saved) per pipeline run
Lifecycle & ownership
When in lifecycle
Operate Observe
Periodic · Quarterly or on-demand campaigns.
Who owns it
SRE / DevOps / Platform
CI/CD, observability, reliability
Collaborates with: Tech Lead / EM
Reference implementations
-
Google SRE: Disaster Role Playing
Operational drill pattern for validating people, process, and runbooks.
-
AWS Well-Architected GameDay
Cloud resilience game-day lab for validating recovery practice.
-
Incident response training
Incident management system that can support drills, comms, and response workflows.
Quick check
Game days / DR drills primarily test…
One question. Pick the best answer. Your streak is saved locally on this device.
Save the lesson
Download SVG ↓Screenshot for a 1:1, drop it in Slack, or download the SVG.