February 28, 2026 · 9 min read · stresstest.qa

How to Run a Chaos Engineering GameDay: Template and Runbook

Step-by-step GameDay planning guide with runbook template, 10 scenario ideas for SaaS startups, and common mistakes to avoid.

How to Run a Chaos Engineering GameDay: Template and Runbook

A Chaos Engineering GameDay is a structured, time-boxed event where engineering teams deliberately inject failures into their systems and observe how they respond. Unlike automated chaos experiments that run continuously, a GameDay is a collaborative exercise that involves engineers from multiple teams, produces shared learning, and builds organizational confidence in the system’s resilience.

GameDays are one of the most effective ways to start a chaos engineering program because they are visible, collaborative, and produce immediate actionable results. This guide gives you a complete GameDay planning template, a five-phase runbook with timing, ten SaaS-specific scenarios to choose from, and the mistakes that cause GameDays to fail.

What Is a Chaos Engineering GameDay?

The term “GameDay” was popularized by Amazon, where engineering teams conduct regular resilience exercises to test their systems and incident response capabilities. The name reflects the collaborative, event-like nature of the exercise - it is a planned event, not a surprise.

A GameDay differs from a fire drill in a crucial way: a fire drill tests whether people know the procedures, while a GameDay tests whether the system and the people can respond effectively to real failures. Participants know a failure will be injected, but they do not know exactly what will happen or how severe the impact will be.

GameDays serve multiple purposes:

  • Technical: Find resilience gaps before users encounter them
  • Organizational: Build cross-team familiarity with failure scenarios and response procedures
  • Process: Validate runbooks, escalation paths, and communication protocols
  • Cultural: Normalize talking about failures as learning opportunities rather than embarrassments

Pre-GameDay Checklist (10 Items)

Run through this checklist at least one week before the GameDay:

  1. Define objectives. What specific resilience properties are you testing? Write two to three sentences describing what success looks like. “We want to verify that the payment service degrades gracefully when the inventory service is unavailable, maintaining checkout for users with items already in their cart.”

  2. Identify participants. A GameDay requires: a Chaos Lead (runs the experiment), an Observer (watches metrics and documents findings), an Application Owner (answers questions about expected behavior), an Incident Commander (manages any real incidents that emerge), and optional Observers from other teams.

  3. Select scenarios. Choose two to four scenarios for a three-hour GameDay. More than four scenarios spreads attention too thin. See the scenario list below.

  4. Define steady state for each scenario. For each scenario, write down the specific metrics and thresholds that indicate the system is healthy. These must be visible on existing dashboards.

  5. Prepare rollback procedures. For each chaos injection, write down the exact command or procedure to stop the experiment and restore normal state. Test these in a staging environment.

  6. Set up communication channels. Create a dedicated Slack channel (or equivalent) for the GameDay. All observations, decisions, and findings go into this channel to create a searchable record.

  7. Brief stakeholders. Notify relevant stakeholders - your on-call team, your customer success team, your manager - that a GameDay is happening. Include the time window, the systems involved, and the expected customer impact (should be zero, but be transparent).

  8. Verify monitoring. Open the dashboards you will watch during the GameDay. Verify that all metrics are current and the dashboards load correctly. The last thing you want is to discover a broken dashboard mid-experiment.

  9. Prepare the tooling. If you are using LitmusChaos, Chaos Mesh, or AWS FIS, verify that the tooling is installed and you can execute a test injection in a non-production environment.

  10. Schedule a retrospective. Block 45 minutes immediately after the GameDay for a structured retrospective. If you do not block this time in advance, it does not happen.

Five-Phase GameDay Runbook

This runbook fits a three-hour GameDay window. Adjust timing for more or fewer scenarios.

Phase 1: Kickoff (15 minutes)

TimeActivity
T+0Welcome participants, confirm attendance
T+5Review objectives for the session
T+10Confirm monitoring dashboards are visible to all participants
T+13Chaos Lead confirms rollback procedures are ready
T+15Move to first scenario

The Chaos Lead reads the hypothesis for the first scenario aloud. Confirm that all participants understand what failure will be injected and what the expected outcome is. This is not optional - if anyone is confused about what is being tested, the learning from the experiment will be muddled.

Phase 2: Baseline Observation (10 minutes per scenario)

Before injecting any failure, observe the system in its normal state. Document:

  • Current values for all steady-state metrics
  • Any anomalies already present in the system
  • Timestamp of baseline observation

A baseline is essential. Without it, you cannot distinguish between a problem caused by your experiment and a problem that already existed.

Phase 3: Chaos Injection (15-30 minutes per scenario)

TimeActivity
T+0Chaos Lead announces: “Injecting [specific failure] now”
T+0Log the injection command and timestamp in the GameDay channel
T+0:30All observers begin watching their assigned metrics
T+1Chaos Lead asks: “What are we seeing?” - go around the room
T+5Chaos Lead asks again. Document observations.
T+15Chaos Lead asks: “Are we at stop condition?”
T+20If no stop condition: continue observing
T+30Stop experiment (or earlier if steady state violated)

Stop conditions are pre-defined states that trigger immediate experiment termination:

  • Steady-state metric crosses critical threshold (not just the experiment threshold)
  • A real incident is triggered (PagerDuty alert fires for real users)
  • A participant identifies an unexpected system behavior that could cascade badly
  • Any participant calls out a safety concern

Stop conditions must be respected immediately. Write them down before the experiment and treat them as non-negotiable.

Phase 4: Recovery Observation (10 minutes per scenario)

After stopping the chaos injection, observe how the system recovers:

  • How long until steady-state metrics return to baseline?
  • Does recovery happen automatically, or does it require manual intervention?
  • Are there any lingering effects (cache inconsistency, queued retries, connection pools not fully recovered)?

Document the recovery timeline with timestamps. Recovery behavior is often where the most interesting findings emerge.

Phase 5: Retrospective (45 minutes)

Run the retrospective immediately after the final scenario. Use this structure:

What did we observe? (15 minutes)

  • Each participant shares one key observation. Chaos Lead documents them.
  • Focus on facts: what metrics changed, by how much, for how long.

What surprised us? (10 minutes)

  • What happened that we did not expect?
  • What hypothesis was wrong, and why?

What do we fix? (15 minutes)

  • For each weakness found, write a specific action item with an owner and a due date.
  • Be concrete: “Add circuit breaker to payment-service’s call to inventory-service” not “improve resilience.”

What do we test next? (5 minutes)

  • Based on what you learned today, what is the next most important scenario to test?

10 SaaS Scenario Ideas

Choose scenarios that match your architecture. Each scenario below includes the failure to inject and the steady-state metrics to watch.

Scenario 1: Single Instance Termination

Inject: Terminate one pod or instance of a stateless service (auth, API, notification). Watch: HTTP success rate, p99 latency, number of healthy instances. Tests: Load balancer health check speed, connection draining, horizontal scaling behavior.

Scenario 2: Database Connection Pool Exhaustion

Inject: Simulate high connection count by opening many idle connections to the database. Watch: Database connection count, application error rate, query latency. Tests: Connection pool configuration, graceful degradation when pool is exhausted.

Scenario 3: Downstream Service Latency

Inject: Add 2-5 seconds of latency to calls from service A to service B. Watch: Service A response time, timeout behavior, circuit breaker trip. Tests: Timeout configuration, circuit breaker implementation, async fallback.

Scenario 4: Third-Party API Outage

Inject: Block or redirect calls to a third-party API (email provider, payment processor, analytics). Watch: Dependent feature availability, error handling, queue backup. Tests: Graceful degradation, retry logic, feature flags for disabling dependent features.

Scenario 5: Memory Pressure

Inject: Inject memory stress on one or more application pods (70-85% memory utilization). Watch: Pod memory usage, GC pause time, application latency, OOM events. Tests: Memory limits, eviction behavior, effect of GC pressure on response times.

Scenario 6: DNS Resolution Failure

Inject: Introduce DNS errors for a specific service name. Watch: Service discovery, connection errors, DNS cache behavior. Tests: DNS caching, retry logic for connection failures, service mesh behavior.

Scenario 7: Message Queue Consumer Failure

Inject: Stop all consumers for one message queue or topic. Watch: Queue depth, producer behavior, message TTL/DLQ behavior. Tests: Queue depth alerting, DLQ configuration, producer back-pressure handling.

Scenario 8: Cache Invalidation (Redis/Memcached Restart)

Inject: Restart the caching layer or flush all cache keys. Watch: Cache hit rate, database query rate, application response time. Tests: Cold-start behavior, thundering herd protection, database capacity under full cache miss.

Scenario 9: Disk I/O Saturation

Inject: Inject disk I/O stress on nodes running stateful services. Watch: Disk I/O metrics, database write latency, application error rate. Tests: Storage performance under pressure, WAL behavior, application timeouts for slow writes.

Scenario 10: Network Partition Between Zones

Inject: Block traffic between pods in availability zone A and availability zone B. Watch: Inter-zone request failure rate, service discovery, data consistency. Tests: Multi-AZ resilience, cross-zone traffic routing, consistency under partition.

Common Mistakes That Cause GameDays to Fail

Mistake 1: No Hypothesis Written Down

“Let’s see what happens when we kill a pod” is not a hypothesis. Without a written hypothesis, observers do not know what to watch, you cannot determine whether the experiment was a success or failure, and the learning degrades to anecdote.

Write the hypothesis on a shared document before injecting anything.

Mistake 2: Skipping the Baseline

Teams eager to run experiments skip the 10-minute baseline observation. Then when something looks wrong after injection, they cannot tell whether it was caused by the experiment or whether it was already present.

Always document baseline metric values before injecting.

Mistake 3: No Rollback Plan

“We’ll figure it out” is not a rollback plan. If an experiment unexpectedly cascades into a real incident, you need to restore normal state immediately. Not having a tested rollback procedure means you may spend 20 minutes figuring out how to stop an experiment while users are affected.

Test your rollback procedures before the GameDay.

Mistake 4: Too Many Scenarios

A three-hour GameDay with six scenarios produces shallow learning. You rush through each scenario, miss nuances in the data, and leave without clear findings. Two to three scenarios with proper baseline observation, careful monitoring, and thorough retrospective per scenario produces far more value.

Mistake 5: Only Including the SRE Team

GameDays are most valuable when developers who built the service observe how it fails. Application developers often know immediately why something is happening (“oh, that’s because we have a 10-second timeout hardcoded”) and that context dramatically accelerates the learning.

Invite at least one developer from each team whose services will be affected.

Mistake 6: No Action Items With Owners

A GameDay retrospective that ends with “we need to improve our resilience” has accomplished nothing actionable. Every weakness found must become a ticket with a specific owner and due date before the retrospective ends.

If the action items do not exist in your issue tracker by end of day, they will not happen.

Ready to run your first GameDay? Our team can design and facilitate a full-day resilience exercise for your engineering team, including scenario design, tooling setup, and a comprehensive findings report delivered within 48 hours.

Know Your Blast Radius

Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.

Talk to an Expert