From Chaos Monkey to Production Chaos: How Top Engineering Teams Build Resilience
The evolution of chaos engineering from Netflix's Chaos Monkey to modern production resilience - with a maturity model for startups.
In 2010, a Netflix engineer wrote a script that randomly terminated EC2 instances in production. It ran during business hours. The engineering team had debated whether this was reckless - and concluded that the recklessness was in not running it. If their services could not survive a single instance termination, they needed to know before a user-facing incident revealed it for them.
That script was Chaos Monkey. Fifteen years later, it has spawned an entire engineering discipline, a CNCF ecosystem of tools, dedicated SRE roles at every major technology company, and a fundamentally different way of thinking about production systems.
This is the story of how chaos engineering evolved, where it stands today, and how startups can implement the same discipline that protects Netflix, Amazon, Google, and Slack at massive scale.
Netflix 2010: The Birth of Chaos Monkey
The context matters. Netflix in 2010 was in the middle of a multi-year migration from its own data centers to AWS. The impetus for the move was a major database corruption incident in 2008 that caused a three-day outage - a catastrophic event for a company that had built its brand on reliability.
The AWS migration brought new capabilities but also new failure modes. EC2 instances were ephemeral in a way that physical servers were not. Any instance could be terminated by AWS with little warning. The network was more complex and less reliable than a data center network. Distributed systems failures were harder to predict and debug than monolithic ones.
Yury Izrailevsky and Ariel Tseitlin, two engineers on Netflix’s Cloud Platform team, articulated the problem clearly: “In a cloud environment where any instance could fail at any time, the best way to ensure that your service can handle these failures is to deliberately inject them into production.” Chaos Monkey was their answer.
The logic was simple and powerful:
- Instances will fail in production.
- We cannot prevent all failures.
- We can ensure our services handle failures gracefully.
- The only way to know if a service handles failures gracefully is to inject failures.
- Therefore, inject failures deliberately and fix what breaks.
Chaos Monkey ran during business hours specifically because that was when the engineering team was available to respond to the problems it revealed. Running it at 3am when no one was watching would have been irresponsible.
Evolution: The Simian Army
By 2011, Netflix had extended the concept beyond single-instance termination. The Simian Army was a collection of chaos tools, each with a different failure mode:
Chaos Gorilla simulated the failure of an entire AWS Availability Zone. Where Chaos Monkey tested instance-level resilience, Chaos Gorilla tested AZ-level resilience. Services had to maintain availability even when an entire AZ - and all instances in it - became unavailable.
Latency Monkey introduced artificial delays in the RESTful client-server communication layer. It tested how services behaved when dependencies became slow rather than completely unavailable. Slow dependencies are often more dangerous than completely unavailable ones because they consume threads, connections, and timeouts in ways that can cascade.
Conformity Monkey checked whether instances conformed to Netflix’s engineering best practices. Instances that violated standards were automatically terminated. This enforced engineering discipline at scale.
Doctor Monkey monitored instances for signs of degradation (high CPU, memory pressure) and terminated them before they could fail in an uncontrolled way.
Janitor Monkey cleaned up unused resources - instances, volumes, and security groups that were no longer needed but consuming capacity and creating potential failure points.
The Simian Army reflected a key insight: resilience is not a single property, it is a portfolio of properties. A service that survives instance termination may not survive AZ failure. A service that survives AZ failure may still fail if a dependency becomes slow. Each Simian tested a different resilience dimension.
Adoption at Amazon, Google, and Slack
Netflix’s public description of the Simian Army inspired other companies to develop their own chaos engineering practices.
Amazon had been doing something similar internally for years - their GameDay practice involved engineering teams deliberately breaking systems and practicing incident response. Amazon built AWS Fault Injection Simulator (launched 2021) as a managed service that brings these practices to all AWS customers.
Google’s SRE practice, documented in the “Site Reliability Engineering” book published in 2016, formalized related concepts through error budgets and disaster recovery testing. Google’s DiRT (Disaster Recovery Testing) program runs large-scale outage simulations annually, testing whether Google’s systems and teams can recover from simulated catastrophes.
Slack published extensively about their chaos engineering practice starting in 2018. Their approach is notable for its emphasis on minimal blast radius - running experiments against a small fraction of production traffic initially, expanding scope as confidence grows. Slack also pioneered the integration of chaos engineering with feature flags, enabling instant experiment termination by toggling a flag.
LinkedIn developed their chaos engineering practice around the concept of “failure injection testing” - systematic testing of how their distributed systems respond to common failure modes including hardware failures, software failures, and human errors.
The pattern across all of these companies is consistent: chaos engineering starts as an experiment-driven practice (what happens when we do this?) and matures into a systematic discipline (we continuously verify our resilience properties against a defined set of failure modes).
The Startup Version
Most startups look at Netflix’s Simian Army and conclude that chaos engineering is a capability for companies with hundreds of SREs and unlimited engineering budgets. This is a mistake.
The fundamental chaos engineering capability that any startup can implement:
Week 1: Define steady state for your three most important services. Write down three to five metrics per service that indicate it is healthy. Connect these to your existing monitoring.
Week 2: Run your first chaos experiment manually. Terminate one instance of one service during business hours. Document what happens. Did steady state hold? What surprised you?
Week 3: Fix what broke. If the instance termination revealed a resilience gap, fix it. This is the most important step - chaos engineering without remediation is theater.
Month 2: Introduce a second experiment type. Test your database failover, or inject latency into a key dependency. Add the findings to your growing library of understood failure modes.
Month 3: Automate one experiment to run weekly. Even a simple scheduled script that terminates a random instance and sends findings to Slack is continuous chaos engineering.
The sophistication of Netflix’s tooling is not what makes chaos engineering valuable. What makes it valuable is the discipline of deliberately confronting failure, learning from what you observe, and systematically improving resilience. That discipline requires no special tooling.
Chaos Engineering Maturity Model
A maturity model helps teams understand where they are and what the next step looks like. This model describes five levels.
Level 0: Reactive
Characteristics:
- Failures are discovered by users
- Post-mortems happen after incidents, not before
- No systematic testing of failure scenarios
- Resilience is assumed, not verified
How to identify Level 0: Your most recent three incidents were all described as “unexpected.” No one had tested the failure mode that caused them.
Level 1: Manual Chaos
Characteristics:
- Teams run ad-hoc chaos experiments, usually after an incident to verify fixes
- No defined steady state before experiments
- Results are anecdotal (“it seemed to work”)
- Experiments run in staging, not production
How to identify Level 1: Your team has terminated instances or tested failover at least once in the past six months, but has no documentation of what was tested, what was observed, or what was fixed as a result.
Level 2: Structured Experiments
Characteristics:
- Steady state is defined before experiments
- Experiments are documented with hypothesis, methodology, and findings
- Experiments run in production with defined stop conditions
- Findings generate concrete engineering tasks
- Experiments cover the most common failure modes (instance failure, database failover, dependency latency)
How to identify Level 2: Your team has a Google Doc or Confluence page documenting at least five chaos experiments with findings and action items.
Level 3: Automated and Scheduled
Characteristics:
- Experiments run automatically on a schedule
- Results are compared to a baseline to detect regressions
- New failure modes are tested within weeks of deployment
- CI/CD integration exists for at least some experiments
- Engineering culture treats chaos findings as normal engineering work
How to identify Level 3: You have a Slack channel where automated chaos experiment results post weekly. Your SRE or platform team reviews these results as a normal part of their work.
Level 4: Continuous Production Chaos
Characteristics:
- Chaos experiments run continuously in production
- Results feed into SLO tracking and error budget accounting
- Every new service is required to pass a chaos experiment suite before production deployment
- Chaos engineering informs capacity planning and infrastructure decisions
- The organization has a shared resilience roadmap that is updated quarterly
How to identify Level 4: Chaos engineering is referenced in your engineering standards documentation. New engineers are onboarded to your chaos program as part of their first month.
Most well-funded startups (Series A and above) should target Level 2 within three months of starting a chaos engineering program, and Level 3 within nine months. Level 4 is the long-term aspiration, not the starting point.
The Business Case for Production Chaos
Engineering teams often need to make the business case for chaos engineering to leadership that is focused on shipping features, not deliberately breaking things.
The business case has three components:
Cost of outages. Calculate your revenue impact per hour of downtime, including direct revenue loss, customer churn, and reputational damage. For a SaaS company doing $5M ARR, a one-hour outage costs roughly $570 in direct revenue, but the churn and reputation effects are typically 5-10x larger. A single 2-hour incident that chaos engineering could have prevented represents more value than months of chaos program investment.
Cost of incident response. Senior engineers responding to production incidents typically cost $200-500 per engineer-hour when you account for total compensation. A complex incident involving five engineers for four hours costs $4,000-10,000 in engineering time alone, plus the opportunity cost of delayed features. Chaos engineering reduces both incident frequency and incident complexity (because failure modes are understood in advance).
Competitive differentiation. Reliability is a product feature. Users and enterprise customers compare reliability when choosing between competing products. A track record of high availability - maintained through systematic resilience practices - is a durable competitive advantage.
The meta-point: chaos engineering does not create risk, it reveals it. The risk already exists in your system. Chaos engineering surfaces it in a controlled environment where you can fix it, rather than allowing it to surface during a 3am incident when you cannot.
Ready to move from reactive incident response to proactive resilience? Our team can help you build a chaos engineering program at any maturity level - from your first structured experiment to a continuous production chaos practice.
Know Your Blast Radius
Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.
Talk to an Expert