March 6, 2026 · 8 min read · stresstest.qa

AWS Fault Injection Simulator: Complete Setup Guide for EKS and EC2

Step-by-step AWS FIS tutorial - IAM setup, EC2 termination experiments, EKS chaos, RDS failover testing, and CI/CD automation.

AWS Fault Injection Simulator: Complete Setup Guide for EKS and EC2

AWS Fault Injection Simulator (FIS) is Amazon’s managed chaos engineering service. It provides a controlled way to inject failures into AWS resources - EC2 instances, EKS pods, RDS databases, DynamoDB tables, and more - using IAM-controlled experiment templates that integrate with your existing AWS tooling.

This tutorial covers everything you need to run production-grade chaos experiments with AWS FIS: IAM setup, your first EC2 experiment, EKS pod failure testing, RDS failover validation, and integrating FIS experiments into GitHub Actions.

What Is AWS FIS?

AWS FIS was launched in March 2021. It is a fully managed service - there is nothing to install or operate. You define experiments as JSON templates stored in the FIS console or via API, and FIS executes them against your AWS resources.

Pricing: FIS charges per action-minute. As of 2026, the price is $0.10 per action-minute for most resources. A 10-minute experiment with three concurrent actions costs $3.00. For typical chaos engineering programs (10-20 experiments per month at 5-15 minutes each), monthly costs are $50-200, making it cost-effective compared to maintaining your own chaos tooling.

Supported targets include:

  • EC2 instances (stop, terminate, reboot, CPU stress, network disruption)
  • EKS pods (pod termination, node drain)
  • ECS tasks (task stop)
  • RDS instances (failover, reboot)
  • DynamoDB (global table pause)
  • Route 53 (health check failures)
  • CloudWatch alarms (trigger alarm state)
  • Spot instances (interruption notices)

What FIS cannot do: FIS does not support arbitrary command execution inside pods or VMs. For in-process fault injection (memory leaks, CPU hogs inside a process, application-level errors), you still need tools like LitmusChaos or Chaos Mesh running inside your cluster.

IAM Setup

AWS FIS requires two IAM roles: one that grants FIS permission to act on your resources, and optionally a second for your CI/CD pipeline to trigger experiments.

FIS Execution Role

Create a role that FIS can assume. This role needs permissions to perform the actions specified in your experiments.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "fis.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Attach this policy to the role (adjust resource ARNs to your specific resources for least privilege):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEC2FaultInjection",
      "Effect": "Allow",
      "Action": [
        "ec2:StopInstances",
        "ec2:TerminateInstances",
        "ec2:RebootInstances",
        "ec2:DescribeInstances"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/chaos-enabled": "true"
        }
      }
    },
    {
      "Sid": "AllowEKSFaultInjection",
      "Effect": "Allow",
      "Action": [
        "eks:DescribeNodegroup",
        "eks:ListNodegroups"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowRDSFaultInjection",
      "Effect": "Allow",
      "Action": [
        "rds:RebootDBInstance",
        "rds:FailoverDBCluster",
        "rds:DescribeDBInstances",
        "rds:DescribeDBClusters"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowSSMForFaultInjection",
      "Effect": "Allow",
      "Action": [
        "ssm:SendCommand",
        "ssm:GetCommandInvocation",
        "ssm:ListCommands",
        "ssm:CancelCommand"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowCloudWatchStopConditions",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:DescribeAlarms"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowIAMPassRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::YOUR_ACCOUNT_ID:role/fis-execution-role"
    }
  ]
}

Important: The chaos-enabled: true tag condition on EC2 actions ensures FIS can only terminate instances you have explicitly tagged for chaos. Tag your non-production or chaos-eligible instances accordingly before running experiments.

CI/CD Role

For GitHub Actions automation, create a second role with permission to start and monitor FIS experiments:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowFISExperimentManagement",
      "Effect": "Allow",
      "Action": [
        "fis:CreateExperimentTemplate",
        "fis:StartExperiment",
        "fis:StopExperiment",
        "fis:GetExperiment",
        "fis:ListExperiments",
        "fis:ListExperimentTemplates",
        "fis:GetExperimentTemplate"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowPassRoleToFIS",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::YOUR_ACCOUNT_ID:role/fis-execution-role"
    }
  ]
}

Your First EC2 Experiment: Instance Termination

This experiment terminates one EC2 instance tagged with chaos-enabled: true and verifies that your application continues to serve traffic.

Step 1: Tag Your EC2 Instances

aws ec2 create-tags \
  --resources i-0abc123def456789a \
  --tags Key=chaos-enabled,Value=true Key=Environment,Value=production

Step 2: Create the Experiment Template

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Terminate one EC2 instance to test load balancer failover",
    "targets": {
      "ec2-instances": {
        "resourceType": "aws:ec2:instance",
        "resourceTags": {
          "chaos-enabled": "true",
          "Environment": "production"
        },
        "selectionMode": "COUNT(1)"
      }
    },
    "actions": {
      "terminate-instance": {
        "actionId": "aws:ec2:terminate-instances",
        "targets": {
          "Instances": "ec2-instances"
        }
      }
    },
    "stopConditions": [
      {
        "source": "aws:cloudwatch:alarm",
        "value": "arn:aws:cloudwatch:us-east-1:YOUR_ACCOUNT_ID:alarm/ApplicationHealthAlarm"
      }
    ],
    "roleArn": "arn:aws:iam::YOUR_ACCOUNT_ID:role/fis-execution-role",
    "tags": {
      "ExperimentType": "ec2-termination",
      "Team": "platform"
    }
  }'

The stopConditions field is critical. It references a CloudWatch alarm that monitors your application health. If the alarm enters ALARM state during the experiment, FIS will automatically stop the experiment. Always configure stop conditions for production experiments.

Step 3: Start the Experiment

# Note the template ID from the create-experiment-template output
TEMPLATE_ID="EXT1234567890"

EXPERIMENT_ID=$(aws fis start-experiment \
  --experiment-template-id $TEMPLATE_ID \
  --query 'experiment.id' \
  --output text)

echo "Experiment ID: $EXPERIMENT_ID"

Step 4: Monitor the Experiment

# Poll experiment status
watch -n 5 aws fis get-experiment \
  --id $EXPERIMENT_ID \
  --query 'experiment.{Status: state.status, Reason: state.reason}' \
  --output table

Simultaneously, watch your application metrics in CloudWatch or your APM tool. The experiment is considered successful if your application maintains its steady state throughout.

EKS Experiments

FIS supports two primary EKS experiment types: pod termination and node termination.

Pod Termination via SSM

FIS executes EKS pod termination through AWS Systems Manager. Your EKS nodes must have the SSM agent installed (included by default on Amazon Linux 2 and Bottlerocket AMIs) and the EC2 instances must have an SSM instance profile attached.

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Terminate 30% of payment-service pods",
    "targets": {
      "eks-nodes": {
        "resourceType": "aws:ec2:instance",
        "resourceTags": {
          "eks:cluster-name": "production-cluster",
          "kubernetes.io/role/worker": "1"
        },
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "terminate-pods": {
        "actionId": "aws:eks:pod-delete",
        "parameters": {
          "clusterIdentifier": "production-cluster",
          "namespace": "production",
          "labelSelectors": "app=payment-service",
          "killPercentage": "30"
        },
        "targets": {
          "Cluster": "eks-nodes"
        }
      }
    },
    "stopConditions": [
      {
        "source": "aws:cloudwatch:alarm",
        "value": "arn:aws:cloudwatch:us-east-1:YOUR_ACCOUNT_ID:alarm/PaymentServiceHealth"
      }
    ],
    "duration": "PT5M",
    "roleArn": "arn:aws:iam::YOUR_ACCOUNT_ID:role/fis-execution-role"
  }'

Node Drain Experiment

A node drain experiment forces Kubernetes to reschedule all pods from a node, simulating a node maintenance event:

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Drain one EKS node to test pod rescheduling",
    "targets": {
      "eks-nodes": {
        "resourceType": "aws:ec2:instance",
        "resourceTags": {
          "eks:cluster-name": "production-cluster"
        },
        "selectionMode": "COUNT(1)"
      }
    },
    "actions": {
      "drain-node": {
        "actionId": "aws:eks:node-drain",
        "parameters": {
          "clusterIdentifier": "production-cluster"
        },
        "targets": {
          "NodeGroup": "eks-nodes"
        }
      }
    },
    "stopConditions": [
      {
        "source": "aws:cloudwatch:alarm",
        "value": "arn:aws:cloudwatch:us-east-1:YOUR_ACCOUNT_ID:alarm/ClusterHealthAlarm"
      }
    ],
    "duration": "PT10M",
    "roleArn": "arn:aws:iam::YOUR_ACCOUNT_ID:role/fis-execution-role"
  }'

RDS Failover Testing

For Multi-AZ RDS deployments, failover testing verifies that your application reconnects successfully after a primary instance failover.

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Trigger RDS Multi-AZ failover",
    "targets": {
      "rds-cluster": {
        "resourceType": "aws:rds:cluster",
        "resourceArns": [
          "arn:aws:rds:us-east-1:YOUR_ACCOUNT_ID:cluster:production-aurora-cluster"
        ],
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "failover-cluster": {
        "actionId": "aws:rds:failover-db-cluster",
        "targets": {
          "Clusters": "rds-cluster"
        }
      }
    },
    "stopConditions": [
      {
        "source": "aws:cloudwatch:alarm",
        "value": "arn:aws:cloudwatch:us-east-1:YOUR_ACCOUNT_ID:alarm/DatabaseHealthAlarm"
      }
    ],
    "roleArn": "arn:aws:iam::YOUR_ACCOUNT_ID:role/fis-execution-role"
  }'

Key metrics to watch during an RDS failover:

  • Application error rate (connection refused errors)
  • Time to reconnection (how long until the application successfully connects to the new primary)
  • Queue backup (if the application processes messages from a queue, how many accumulate during the failover window)
  • Data consistency (any writes attempted during failover - were they lost, queued, or retried?)

A well-configured application with connection pooling and retry logic should recover from an Aurora failover in 30-60 seconds. If recovery takes longer, investigate connection pool configuration and retry backoff settings.

GitHub Actions Integration

Running FIS experiments automatically in CI/CD enables continuous chaos testing - every deployment can trigger a suite of chaos experiments to verify the new code version maintains resilience.

name: Chaos Resilience Tests

on:
  workflow_dispatch:
    inputs:
      environment:
        description: 'Target environment'
        required: true
        default: 'staging'
  schedule:
    # Run every Tuesday and Thursday at 2pm UTC
    - cron: '0 14 * * 2,4'

permissions:
  id-token: write
  contents: read

jobs:
  run-chaos-experiments:
    runs-on: ubuntu-latest
    environment: ${{ github.event.inputs.environment || 'staging' }}

    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/github-actions-fis-role
          aws-region: us-east-1

      - name: Run EC2 instance termination experiment
        id: ec2-experiment
        run: |
          EXPERIMENT_ID=$(aws fis start-experiment \
            --experiment-template-id ${{ vars.FIS_EC2_TEMPLATE_ID }} \
            --query 'experiment.id' \
            --output text)
          echo "experiment_id=$EXPERIMENT_ID" >> $GITHUB_OUTPUT
          echo "Started experiment: $EXPERIMENT_ID"

      - name: Wait for experiment to complete
        run: |
          EXPERIMENT_ID=${{ steps.ec2-experiment.outputs.experiment_id }}
          MAX_WAIT=600  # 10 minutes
          ELAPSED=0

          while [ $ELAPSED -lt $MAX_WAIT ]; do
            STATUS=$(aws fis get-experiment \
              --id $EXPERIMENT_ID \
              --query 'experiment.state.status' \
              --output text)

            echo "Experiment status: $STATUS (${ELAPSED}s elapsed)"

            if [ "$STATUS" = "completed" ]; then
              echo "Experiment completed successfully"
              exit 0
            elif [ "$STATUS" = "failed" ] || [ "$STATUS" = "stopped" ]; then
              echo "Experiment $STATUS - checking reason"
              aws fis get-experiment \
                --id $EXPERIMENT_ID \
                --query 'experiment.state.reason' \
                --output text
              exit 1
            fi

            sleep 15
            ELAPSED=$((ELAPSED + 15))
          done

          echo "Experiment timed out"
          aws fis stop-experiment --id $EXPERIMENT_ID
          exit 1

      - name: Verify steady state post-experiment
        run: |
          # Check CloudWatch alarm state 5 minutes after experiment
          sleep 300
          ALARM_STATE=$(aws cloudwatch describe-alarms \
            --alarm-names "ApplicationHealthAlarm" \
            --query 'MetricAlarms[0].StateValue' \
            --output text)

          if [ "$ALARM_STATE" != "OK" ]; then
            echo "Application health alarm is $ALARM_STATE after experiment"
            exit 1
          fi
          echo "Application health confirmed OK after experiment"

      - name: Notify on failure
        if: failure()
        uses: 8398a7/action-slack@v3
        with:
          status: failure
          text: "Chaos experiment failed - system resilience regression detected"
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

AWS FIS vs Open-Source Chaos Tools

CriterionAWS FISLitmusChaosChaos Mesh
AWS service integrationNative, deepVia experiment scriptsLimited
Kubernetes supportVia SSMNative CRDsNative CRDs
In-process fault injectionNoYesYes
Operational overheadZero (managed)ModerateModerate
Cost$0.10/action-minuteFree (open source)Free (open source)
IAM integrationNativeKubernetes RBACKubernetes RBAC
Experiment versioningJSON templatesCRDs in gitCRDs in git
Multi-region supportYesRequires multi-cluster setupRequires multi-cluster setup
Stop conditionsCloudWatch alarmsProbesProbes

The right choice depends on your infrastructure. For AWS-native teams with primarily EC2, EKS, and RDS workloads, FIS is the right primary tool. For teams that need in-process fault injection (memory leaks, CPU hogs inside application processes, application-level error injection), complement FIS with LitmusChaos or Chaos Mesh running in your EKS cluster.

Many mature chaos engineering programs use both: FIS for infrastructure-level failures (instance termination, network disruption, service failover) and a K8s-native tool for application-level failures.

Our team helps AWS engineering teams design and execute chaos experiments that produce real resilience improvements. If you are running on AWS and want to build a systematic chaos engineering program, get in touch.

Know Your Blast Radius

Book a free 30-minute resilience scope call with our chaos engineers. We review your architecture, identify your highest-risk failure modes, and recommend the experiments that will give you the most signal.

Talk to an Expert