March 12, 2026 · 9 min read · stresstest.qa

Steady-State Hypothesis: The Most Important Step in Chaos Engineering

Learn why defining steady state before chaos experiments is critical - with examples for monolith, microservices, and event-driven architectures.

Steady-State Hypothesis: The Most Important Step in Chaos Engineering

Most chaos engineering programs fail quietly. Teams run experiments, observe interesting failures, write up findings - and then repeat the same experiments six months later because nothing has fundamentally changed. The missing piece is almost always the same: they never defined steady state.

The steady-state hypothesis is the most important concept in the Principles of Chaos Engineering, and it is the step most commonly skipped by teams eager to start injecting failures. This guide explains what steady state is, why it matters more than any other step in the process, and how to define it correctly for three common architecture types.

What Is Steady State?

Steady state is the normal, measurable behavior of a system when everything is functioning correctly. It is not “the system is up” - it is a specific set of quantifiable outputs that indicate the system is providing value to users at an acceptable level.

In the Principles of Chaos Engineering, the steady-state hypothesis takes the form:

“We hypothesize that if we inject [specific failure], the system will maintain [specific measurable behavior].”

The second half of that sentence - “specific measurable behavior” - is the steady state. It must be:

  1. Measurable in real time. You must be able to observe it from your monitoring system during the experiment, not calculate it afterward.
  2. Quantitative. “The system is healthy” is not steady state. “HTTP success rate above 99.5%” is steady state.
  3. User-relevant. Steady state should reflect the system’s behavior from the user’s perspective, not internal technical metrics that may not correlate with user experience.
  4. Achievable. The steady-state threshold should reflect normal operating conditions, not aspirational performance goals.

Why Teams Skip It

There are three common reasons teams skip steady-state definition.

Reason 1: It feels obvious. “Of course we know what normal looks like - the system is up and handling traffic.” But when a metric spikes during an experiment, teams often debate whether what they are seeing is abnormal. Without a pre-defined baseline, every observation is subject to interpretation.

Reason 2: It requires work before the fun part. Defining steady state means pulling data from your monitoring system, establishing baselines, and writing down specific numbers. It is slower and less exciting than running chaos experiments.

Reason 3: Teams do not have the metrics. This is the most important reason, and it is a gift. If you try to define steady state and discover you do not have the metrics to measure it, you have identified a more fundamental problem than any chaos experiment could reveal: your system is not observable enough to know whether it is healthy.

Why It Is the Most Important Step

Without steady state, your chaos experiments produce anecdotes instead of findings.

Consider two teams running the same experiment - terminating one instance of their API service:

Team A (no steady state defined): Terminates instance. Watches the instance list. Instance disappears and a new one comes up. Team observes no obvious errors. Concludes: “The system handled it fine.”

Team B (steady state defined: HTTP success rate above 99.5%, p99 latency below 200ms): Terminates instance. Watches their dashboard. During the 45-second period before a new instance registers as healthy, p99 latency spikes to 1.2 seconds. HTTP success rate drops to 97.8% - below the 99.5% threshold. Conclusion: “Our load balancer health check interval is too long. We need to reduce it from 60 seconds to 10 seconds.”

Same experiment. Completely different outcomes. The difference is steady-state definition.

Steady state transforms chaos engineering from observation into verification.

Four Steps to Define Steady State

Step 1: Start With User-Facing Metrics

The most important steady-state metrics are the ones that directly measure the user experience:

  • HTTP success rate: Percentage of requests that return non-5xx status codes
  • Request latency: p50, p95, p99 latency at the load balancer or API gateway level
  • Business throughput: Orders processed per minute, signups per hour, payments completed per minute
  • Availability: Is the service reachable at all?

These metrics are user-relevant. A database query that takes 10ms instead of 1ms is only meaningful if it contributes to degraded user experience.

Step 2: Pull Historical Data

Log into your monitoring system and pull the last 30 days of data for each metric you have identified. Look for:

  • Normal range: What does the metric look like on a typical Tuesday afternoon?
  • Normal variation: What is the standard deviation? What is the difference between weekday and weekend behavior?
  • Existing anomalies: Are there spikes or drops that you know correspond to deployments, traffic events, or incidents?

From this data, set your steady-state threshold at a level that represents “clearly healthy, not stressed.” For most services, this is approximately the 10th percentile of recent values - the system performing at this level is clearly operating in a healthy state.

Step 3: Write It Down Explicitly

Document your steady-state definition in a format that all participants can reference during an experiment:

System: payment-api
Experiment: pod termination (30% of pods)
Steady State:
  - HTTP success rate >= 99.0% (15-min trailing window)
  - p99 request latency <= 300ms (15-min trailing window)
  - Orders processed per minute >= 150 (5-min trailing window)

Stop Condition: If HTTP success rate drops below 98.0% OR
                p99 latency exceeds 1000ms, abort immediately.

The stop condition is separate from steady state. Steady state is the threshold you are testing. The stop condition is a more severe threshold that triggers immediate experiment termination.

Step 4: Verify Measurement Before Experimenting

Before running the experiment, confirm that you can see all three steady-state metrics on a dashboard that all participants have access to. Do a dry run: “If we see the p99 latency panel, which panel is it? Where is the HTTP success rate? How do we know if the business metric is in range?”

This verification step takes five minutes and prevents the common scenario where teams discover during an experiment that they cannot find the relevant metrics.

Steady State by Architecture Type

Different architectures produce different steady-state metrics. Here are examples for three common patterns.

Monolithic Application

A monolith running as a small number of instances typically has simple, clear steady-state metrics:

MetricMeasurement PointThreshold
HTTP success rateLoad balancer>= 99.5%
p99 latencyLoad balancer<= 500ms
Instance healthLoad balancer>= 2 healthy instances
Database connection poolAPM<= 80% utilized
Error rate in logsLog aggregation< 0.1% of requests

For a monolith, the most important chaos experiments involve instance termination and database failures. The steady state should confirm that the system remains available with N-1 instances and that database failover does not cause prolonged unavailability.

Microservices Architecture

Microservices steady state is more complex because each service has its own behavior, and the relationships between services create emergent failure modes.

Define steady state at two levels:

System level (user-facing):

  • End-to-end transaction success rate >= 99.0%
  • End-to-end p99 latency <= 1000ms
  • Business throughput >= baseline transactions/minute

Service level (for targeted experiments):

ServiceMetricThreshold
API GatewayRequest success rate>= 99.5%
Auth ServiceToken issuance rate>= 500/min
Order ServiceOrder creation success rate>= 99.0%
Inventory ServiceStock check latencyp99 <= 50ms
Notification ServiceMessage delivery rate>= 95% (async, degraded ok)

The key insight for microservices is that not all services are equal. A 5-minute outage of the notification service may be acceptable (notifications are async and can queue). A 30-second degradation of the auth service is not acceptable (users cannot complete any transaction). Your steady-state thresholds should reflect this hierarchy.

Event-Driven Architecture

Event-driven systems require steady-state metrics that go beyond HTTP latency. The primary value flow happens through message queues, and the steady state must capture queue health:

MetricMeasurement PointThreshold
Producer success rateMessage queue>= 99.9%
Consumer lagQueue monitoring<= 30 seconds
Queue depthQueue monitoring<= 10x normal
Dead letter queue rateDLQ< 0.1% of messages
Processing latencyConsumer metricsp99 <= 5 seconds
End-to-end latencyBusiness metricsOrder to confirmation <= 30 seconds

For event-driven systems, the most important chaos experiments involve consumer failure (stop all consumers for a queue), producer latency (slow down producers), and broker failure (restart the message broker). Each of these has distinct effects on the steady-state metrics above.

Connection to SLOs and Error Budgets

Well-designed steady-state definitions should connect directly to your Service Level Objectives (SLOs). If you have defined an SLO of 99.5% availability for your checkout service, your steady-state threshold for chaos experiments should be at or above 99.5%.

This connection has several benefits:

It makes chaos experiments business-relevant. When a chaos experiment violates steady state, the interpretation is clear: this failure mode would consume error budget. The business case for fixing it is built into the SLO framework.

It normalizes the resilience conversation. SRE teams already understand error budgets and burn rates. Framing chaos findings in terms of “this failure mode would burn 12% of our monthly error budget in 30 minutes” is more compelling to engineering leadership than “the service degraded.”

It prioritizes which experiments to run. Services with tight SLOs and small error budgets should receive more chaos engineering attention than services with loose SLOs and large budgets.

If your team does not yet have SLOs defined, the process of defining steady state for chaos engineering is an excellent forcing function to start. The metrics you identify are exactly the metrics that should form the basis of your SLOs.

Common Steady-State Mistakes

Using internal metrics instead of user-facing metrics. “Database query latency <= 10ms” is not a good steady-state metric. “Application p99 response time <= 200ms” is better. The database metric may be important, but it is only meaningful as a contributor to the user-facing metric.

Setting thresholds too loose. “HTTP success rate >= 90%” means you would accept 10% error rates during chaos experiments. Real users would notice a 10% error rate immediately. Set thresholds that reflect what your users actually experience.

Setting thresholds too tight. If your system currently operates at 99.7% success rate and you set a steady-state threshold of 99.9%, experiments will appear to “fail” even though the system is behaving normally. Pull real historical data before setting thresholds.

Defining steady state for the wrong window. Instant metrics are noisy. A single-second spike to 99.0% success rate during an experiment is not the same as a sustained 15-minute drop to 99.0%. Always specify the measurement window in your steady-state definition: “HTTP success rate >= 99.5% measured over a 5-minute trailing window.”

The steady-state hypothesis is not bureaucracy. It is the mechanism that transforms chaos engineering from interesting destruction into systematic resilience improvement. Teams that get this right build systems that genuinely fail gracefully under real-world conditions.

Know Your Blast Radius

Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.

Talk to an Expert