Corralling chaos with AWS Fault Injection Service and CloudWatch
Senior Azure Architect Amy Coughlin explores how to use CloudWatch and AWS Fault Injection service to control chaos experiments.
Nov 14, 2024 • 4 Minute Read
So, you’ve learned all about chaos engineering and you’re thinking of ways to unleash a storm upon your systems to test their resilience. But before you do, how do you make sure you can keep that chaos from spilling over and doing real damage? In this tutorial, I’ll share how you can use CloudWatch in combination with the AWS Fault Injection Service (FIS) to automatically corral chaos experiments that have bucked out of control.
The benefits of using FIS and CloudWatch for controlling chaos engineering experiments
Most chaos engineering platforms provide a way to manually rein in an experiment and roll back to a healthier state. However, FIS goes the extra mile by allowing you to set up conditions under which an experiment will automatically stop, before going over the proverbial cliff.
This is accomplished through an integration of CloudWatch alarms with FIS experiments, enabling an experiment to have one or more stop conditions. It’s like having a trusty ranch hand ready to lasso any runaway chaos. If a stop condition is triggered during an experiment, AWS FIS stops the experiment.
How to create your experiment’s stop conditions
To create a stop condition, you first need to have already identified what you consider a healthy working state for your service, and then decide what your threshold is for a less-than-optimal state that is still acceptable under chaotic conditions. Keep in mind that you should expect that the injected chaos will impact optimal performance, so you won’t usually set it to the fully healthy state. Then you call that your “steady state” in order to create a CloudWatch alarm. The alarm is designed to stop an experiment if the state of your application or service exceeds that minimum operational threshold.
For example, you might want to set up an alarm based on high CPU utilization on an EC2 instance and then stop the experiment when the average CPU utilization exceeds 85 percent. Depending on how you have CloudWatch monitoring enabled, you can also control the period over which to measure the average utilization, generally ranging from one to ten minutes.
Setting Up a CPUUtilization Alarm using the AWS CLI
The CLI code, below, creates an alarm. But before you can set up an alarm, you will need to create an Amazon Simple Notification Service (SNS) topic (Setup SNS). In the example code, the pre-created SNS topic is referenced in the next-to-last line. All values in red or in double-quotes are configurable.
Also note that the period for this type of alarm defaults to 300 seconds, but if you have detailed monitoring configured, you can set it to 60 seconds.
This alarm is then used as a part of a stop condition on an experiment. So, if the herd is sort of kicking and stamping about, instead of heeding the whistles of the cowfolk, that’s ok. For now. But if it develops into an all-out stampede, the stop condition can kick in and restore peace on the open plain.
To be fair, you can assemble something similar in Azure with Azure Monitor alerts and Azure Automation, but it just isn’t as tightly integrated with the Azure Chaos Studio service.
Learn more about chaos engineering (before letting the bulls run wild)
If you’re interested in learning more about how to conduct either Azure Chaos Studio experiments or AWS FIS experiments (or both!) check out these learning opportunities on Pluralsight:
- Azure Chaos Engineering Essentials Path | Pluralsight
- Hands-On Chaos Engineering with AWS Fault Injection Simulator
Also, if you’re interested in how you might use Azure for chaos engineering, check out this blog post: “Chaos engineering: Creating a perfect storm with Azure Chaos Studio.”