Kubernetes Breaks in Ways You Haven't Tested
Kubernetes adds layers of resilience abstraction that create new failure modes. We run K8s-native chaos experiments targeting pods, nodes, network policies, and StatefulSets — and fix what we find.
You might be experiencing...
Kubernetes resilience testing targets the failure modes that are unique to container orchestration: pod eviction cascades, PodDisruptionBudget enforcement gaps, StatefulSet recovery sequences, and network policy behaviour under partition. These failure modes are invisible in standard load tests and only surface under deliberate chaos or in production incidents.
The most common finding from K8s chaos engagements is that pod recovery takes significantly longer than expected. The Kubernetes control plane, scheduler, and kubelet introduce latency that makes pod recovery a seconds-to-minutes operation, not milliseconds. During that window, traffic is being dropped, retried, or shed — and the behaviour depends on how readiness probes, HPA, and load balancer health checks are configured. We measure the actual impact, not the theoretical one.
StatefulSet resilience is a particular focus: databases, message queues, and caches running as StatefulSets have recovery sequences that depend on PVC reattachment, leader election, and data sync — all of which take time and all of which can fail. We validate the full recovery sequence under realistic conditions and produce a timed runbook your team can use during actual incidents.
Engagement Phases
K8s Architecture Review
We review your cluster configuration, workload definitions, PodDisruptionBudgets, HPA/VPA configs, network policies, and StatefulSet storage configurations. We identify K8s-specific failure modes and design a targeted experiment backlog.
Pod & Node Chaos
We run pod eviction experiments (random pod kill, OOMKill simulation, init container failure), node failure scenarios (cordon/drain, node termination), and resource pressure tests (CPU throttling, memory pressure). We measure pod recovery time, traffic impact, and HPA response.
Network & StatefulSet Chaos
We test network policy failures, DNS disruption, service mesh chaos (if applicable), and StatefulSet recovery sequences. We validate PDB enforcement, measure StatefulSet pod recovery with persistent volume reattachment, and verify that cluster autoscaler responds correctly.
Deliverables
Before & After
| Metric | Before | After |
|---|---|---|
| Pod recovery time | Assumed 30 s | 4.5 min measured |
| PDB violations | 0 known | 3 identified |
| StatefulSet recovery | Untested | Validated with timing |
Tools We Use
Frequently Asked Questions
Do you need cluster admin access?
We need cluster-admin in the target environment to install LitmusChaos and execute node-level experiments. For production clusters, we can work with a restricted role that covers pod deletion and resource annotation. We document every permission we use and remove the chaos tooling at engagement close if preferred.
What Kubernetes versions and platforms do you support?
We support Kubernetes 1.24 and above on any CNCF-conformant distribution: EKS, GKE, AKS, OpenShift, Rancher, and self-managed clusters. LitmusChaos and Chaos Mesh both have wide platform support. We flag any platform-specific limitations in the Day 1 architecture review.
We use a service mesh — does that change the engagement?
Yes, in a good way. Istio, Linkerd, and Envoy-based meshes add traffic management primitives that we can test specifically — fault injection via VirtualService, timeout configuration, and circuit breaker behaviour. Service mesh chaos is included in our experiment backlog for mesh-enabled clusters.
Know Your Blast Radius
Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.
Talk to an Expert