Day 32 of 60 · Load, chaos, durability

Game days & DR drills

Your runbook works on paper. Does the team work it at 3 AM, with the right person paged, with the right Slack channel still up? Drills tell you.

ProblemYour runbook works on paper. Does the team work it under stress at 3 AM?

How it works

Schedule a controlled outage. Page the on-call. Run the drill. Measure recovery time. Update the runbook.

What it catches

Process gaps, missing runbooks, miscalibrated alerting, incident-comms friction. The "soft" defects no tool finds.

Tools

(scenario design, no tooling) · OSS

Verdict by project size

Small
Skip
Medium
Skip
Large
Rec
Extra-large
Must

Cost

Project size Setup Maint / mo Tool / mo CI / run
Small <10k LOC 0h 0h $0 ,
Medium 10–100k LOC 1d 2h $0 ,
Large 100k–1M LOC 5d 10h $0 ,
Extra-large >1M LOC 20d 40h $0 ,
Setup = engineer-days to first useful run · Maint = engineer-hours / month at steady state · Tool = out-of-pocket $ / month · CI = minutes added (or saved) per pipeline run

Lifecycle & ownership

When in lifecycle
Operate Observe
Periodic · Quarterly or on-demand campaigns.
Who owns it
SRE / DevOps / Platform
CI/CD, observability, reliability
Collaborates with: Tech Lead / EM

Reference implementations

Quick check

Game days / DR drills primarily test…

One question. Pick the best answer. Your streak is saved locally on this device.

Save the lesson

Download SVG ↓

Screenshot for a 1:1, drop it in Slack, or download the SVG.

thinkbridge THE VALIDATION ATLAS DAY 32 OF 60 LOAD, CHAOS, DURABILITY Game days & DR drills Your runbook works on paper. Does the team work it at 3 AM,with the right person paged, with the right Slack channelstill up? Drills tell you. FIVE-MINUTE LESSON · ONE QUICK-CHECK QUESTION There’s a new way there
All 60 days →