February 16, 2026 · 10 min read · stresstest.qa

What Is Chaos Engineering? A Practical Guide for Engineering Teams

A comprehensive guide to chaos engineering - principles, tools, and how to run your first chaos experiment. Learn why startups need resilience testing.

What Is Chaos Engineering? A Practical Guide for Engineering Teams

Modern distributed systems fail in ways that are impossible to predict by reading code alone. Services crash, networks partition, databases slow down, and dependencies return unexpected errors - often all at once, during your highest-traffic periods. Chaos engineering is the discipline of deliberately injecting failures into your systems to discover weaknesses before they cause outages that affect real users.

This guide covers everything an engineering team needs to understand and practice chaos engineering: where it came from, the principles that make it effective, how to run your first experiment, and the tools available today.

The Netflix Origin Story

In 2010, Netflix completed its migration from on-premise data centers to AWS. The engineering team recognized that any of the hundreds of EC2 instances they depended on could fail at any moment - and AWS did not guarantee 100% availability. The traditional response was to hope failures wouldn’t happen, or to write defensive code and trust that it worked.

The Netflix team took a different approach. They built a tool called Chaos Monkey that would randomly terminate EC2 instances in production during business hours. The logic was simple but powerful: if your services can’t tolerate a single instance failure, you need to know now, not during a 3am incident.

Chaos Monkey forced engineers to build resilient services. If your service died when Chaos Monkey killed one of its instances, you had to fix it - because Chaos Monkey would keep running.

By 2012, Netflix had expanded this into the Simian Army: a collection of chaos tools including Chaos Gorilla (which terminated entire availability zones), Latency Monkey (which introduced artificial delays), and Conformity Monkey (which checked for non-compliant instances). This body of work eventually became the foundation for what the industry now calls chaos engineering.

In 2017, the Netflix engineering team published the Principles of Chaos Engineering - a formal specification that gave the discipline a rigorous foundation and enabled the broader industry to adopt it systematically.

The Five Principles of Chaos Engineering

The principles are not rules to follow mechanically. They are constraints that separate effective chaos engineering from random destruction.

1. Build a Hypothesis Around Steady State Behavior

Before running any experiment, you must define what “normal” looks like. Steady state is a measurable output of the system that indicates it is working correctly: request success rate above 99.5%, p99 latency below 200ms, order processing throughput above 1000/minute.

Without a steady state definition, you cannot know whether your experiment revealed a problem. A hypothesis takes the form: “If we inject X failure, the system will maintain steady state Y.”

2. Vary Real-World Events

The failures you inject should mirror real failures that happen in production. AWS instances do terminate unexpectedly. Networks do partition. Disks do fill up. Third-party APIs do return 500 errors. Injecting synthetic failures that never happen in reality produces results that are interesting but not actionable.

Real-world failure categories include: hardware failures (instance termination, disk failures), network failures (latency, packet loss, DNS failures), software failures (process crashes, memory pressure, clock skew), and dependency failures (third-party API errors, database connection exhaustion).

3. Run Experiments in Production

This is the principle most teams resist. Running chaos experiments in staging tells you how your staging environment behaves, not your production environment. Production has different traffic patterns, different data volumes, different third-party integrations, and different infrastructure configurations.

The goal is not recklessness - it is controlled, minimized-blast-radius experiments in the environment that actually matters. Start with a small percentage of production traffic. Build kill switches. Have rollback procedures ready. But run in production.

4. Automate Experiments to Run Continuously

A one-time chaos experiment is a snapshot. Systems change constantly: deployments add new dependencies, configuration changes alter failure modes, traffic growth changes bottlenecks. Continuous chaos means running experiments on a regular schedule so you discover new weaknesses as they are introduced, not months later during an incident.

5. Minimize Blast Radius

Chaos engineering is not about causing outages. It is about finding weaknesses in a controlled way. Start small: one instance in one availability zone, affecting one percent of traffic. Increase scope gradually as you build confidence in your system and your tooling.

A well-run chaos experiment should be invisible to users. If it causes a real outage, you have found a critical weakness - but you have also disrupted your users unnecessarily, which undermines organizational trust in the practice.

Chaos Engineering vs Traditional Testing

Teams often ask how chaos engineering relates to their existing testing practices. The answer is that they are complementary, not competing.

DimensionUnit/Integration TestsLoad TestingChaos Engineering
What it testsCode logic and contractsSystem behavior under volumeSystem behavior under failure
When it runsEvery commitPre-release, periodicContinuously in production
What it findsLogic bugs, regressionsThroughput limits, bottlenecksResilience gaps, hidden dependencies
EnvironmentLocal, CIStaging, pre-prodProduction (ideally)
Requires system runningNoYesYes
MeasuresPass/failLatency, throughputSteady state maintenance

Traditional testing verifies that your system works correctly when everything is functioning as designed. Chaos engineering verifies that your system continues to function when things go wrong. Both are necessary for production-ready services.

How to Run Your First Chaos Experiment

Running a chaos experiment is a structured process, not a random act of destruction. Follow these seven steps for your first experiment.

Step 1: Choose a System to Test

Start with a service that is important enough to matter but not so critical that an unexpected failure would be catastrophic. A good first target is a stateless API service with horizontal scaling and a clear set of downstream dependencies.

Avoid starting with: databases, payment processing, authentication services, or any service where failure has immediate customer-facing consequences beyond degraded performance.

Step 2: Define Steady State

Identify two to four metrics that indicate the service is healthy. Pull these from your existing monitoring:

  • HTTP success rate (non-5xx responses): target above 99.5%
  • p99 request latency: target below 500ms
  • Queue processing rate: target above X messages/minute
  • Business metric: target above Y transactions/minute

These must be measurable in real time from your monitoring system. If you cannot measure steady state, you cannot run a chaos experiment.

Step 3: Form a Hypothesis

Write it down explicitly: “If we terminate one of the three instances running the payment-api service, the HTTP success rate will remain above 99.5% and p99 latency will remain below 500ms, because the load balancer will route traffic to the remaining two instances within 30 seconds.”

The hypothesis forces you to think through your assumptions: Does the load balancer health-check interval allow fast enough failover? Are the remaining instances sized to handle the additional load? Are there any sticky sessions that would strand requests?

Step 4: Plan the Experiment

Define:

  • Injection method: How you will inject the failure (terminate instance, inject network latency, kill a process)
  • Scope: Which specific resources (instance IDs, pod selectors, service names)
  • Duration: How long the failure condition will persist
  • Monitoring: Which dashboards you will watch
  • Stop conditions: What observable state will cause you to abort
  • Rollback: How you will restore normal state if needed

Step 5: Inject the Failure

Execute the injection and observe. Watch your steady-state metrics, not just the system under direct attack. Often the most interesting findings are in systems you did not expect to be affected.

Keep a time-stamped log of what you observe during the experiment. Note when metrics deviate from baseline, by how much, and for how long.

Step 6: Analyze Results

Did the system maintain steady state? If yes, your hypothesis was confirmed - the system is resilient to this specific failure. Document the result and move to a more severe failure mode.

If steady state was violated, you have found a weakness. Analyze the failure cascade: What failed first? What did it affect? Was the recovery automatic or manual? How long did degradation last?

Step 7: Fix and Repeat

Chaos engineering without remediation is theater. Each weakness you find should generate a concrete engineering task: add circuit breakers, increase replica counts, add retry logic, implement graceful degradation, improve health checks.

After fixing, re-run the same experiment to confirm the fix works.

Tools Comparison

The chaos engineering ecosystem has matured significantly. Here are the primary tools used by engineering teams today.

ToolTarget EnvironmentLanguageOpen SourceManagedBest For
LitmusChaosKubernetesGoYesVia HarnessK8s-native teams
Chaos MeshKubernetesGoYesNoK8s with CRD-based workflows
GremlinAnySaaSNoYesEnterprise, multi-cloud
AWS FISAWSN/ANoYes (AWS)AWS-native infrastructure
Chaos ToolkitAnyPythonYesNoAPI-driven automation
PumbaDockerGoYesNoDocker/local environments

LitmusChaos

LitmusChaos is the CNCF-incubated chaos engineering platform for Kubernetes. It provides 50+ pre-built experiments as CRDs, a workflow engine for multi-step chaos scenarios, and a web UI for experiment management. The Harness Chaos Engineering platform is built on LitmusChaos.

Best for: Teams running on Kubernetes who want a comprehensive, open-source platform with an active community.

Chaos Mesh

Chaos Mesh is a CNCF project that implements chaos experiments as Kubernetes Custom Resource Definitions. Every experiment type is a CRD, making it natural to version-control experiments alongside application code. It includes a dashboard and supports a wide range of fault types including pod failures, network chaos, stress testing, and time chaos.

Best for: Teams who prefer GitOps-native chaos workflows and want experiments defined as Kubernetes manifests.

Gremlin

Gremlin is a commercial SaaS chaos engineering platform. It provides a broad range of fault types, multi-cloud support, scenario workflows, and enterprise features including audit logging, role-based access, and reporting. The managed nature reduces operational overhead.

Best for: Enterprise teams with multi-cloud infrastructure and compliance requirements, or teams that want chaos engineering without managing the tooling.

AWS Fault Injection Simulator

AWS FIS is Amazon’s managed chaos engineering service. It integrates natively with AWS services - EC2, EKS, ECS, RDS, DynamoDB, and more - and uses IAM for access control. Experiments are defined as JSON templates and can be triggered from CI/CD pipelines.

Best for: Teams running primarily on AWS who want tight integration with their existing AWS tooling.

Chaos Engineering Maturity Levels

Most organizations start chaos engineering at a low maturity level and evolve over months or years. Understanding where you are helps you set realistic expectations.

Level 0 - No Chaos: All testing is pre-production. Production failures are discovered by users.

Level 1 - Manual Experiments: Teams run ad-hoc chaos experiments manually, typically after incidents to verify fixes. No automation, no continuous execution.

Level 2 - Scheduled Experiments: Experiments run on a schedule (weekly, monthly). Steady state is defined. Results are documented.

Level 3 - CI/CD Integration: Chaos experiments run automatically on deployment or on a continuous schedule. Failures block deployments.

Level 4 - Continuous Chaos: Experiments run continuously in production. Results feed into SLO tracking. Engineering culture treats chaos as standard practice, not special events.

Most startups should target Level 2 within three months of starting chaos engineering, and Level 3 within six months.

When to Hire a Chaos Engineering Consultant

Internal chaos engineering programs often stall for predictable reasons: teams lack expertise to design effective experiments, organizational resistance from teams who fear their systems will be exposed, or tooling complexity that consumes more time than the experiments themselves.

A specialist can accelerate your program significantly: designing a first-experiment portfolio based on your specific architecture, running facilitated GameDays that build organizational confidence, setting up automated tooling that runs without ongoing maintenance, and training your SRE or platform engineering team on the discipline.

If your team has experienced three or more production incidents in the past six months that were “unexpected” failures in systems that passed testing, it is time to talk to a chaos engineering specialist.

Ready to start? Our Resilience Assessment is a structured engagement that identifies your highest-impact resilience gaps and delivers a prioritized chaos experiment roadmap in two weeks.

Know Your Blast Radius

Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.

Talk to an Expert