Chaos engineering: Perfect storms inside Azure Chaos Studio
A tutorial on how to test the resiliency of your systems using controlled chaos using Azure Chaos Studio by Senior Azure Architect Amy Coughlin.
Nov 12, 2024 • 4 Minute Read
So, you’ve learned all about chaos engineering and now you’re ready to unleash a storm upon your systems to see how it holds up (and plug any resulting leaks you find). This tutorial will show you how to do that using Azure Chaos Studio, a cloud service specifically designed to conduct exactly these sorts of experiments.
How Azure Chaos Studio experiments are structured
Azure is a service on which you build experiments. An experiment can apply one or more actions — that is, inject faults — which can be executed in sequence or in parallel. Each experiment is designed to run against one or more targets, which could be Azure Services, such as an Azure SQL Database, or it could be a component within a broader application. Those services and components, individually, are called targets. Think of them as the ships, bobbing away at anchor, but deliberately placed in harm’s way to test their resilience.
Actions are not free-floating within an experiment definition. They are actually wrapped by steps, like this:
Steps are executed sequentially, which is why I’ve numbered them with step 1, step 2, and on from there. It’s like navigating through a series of waves, one after the other.
But there is yet another layer between steps and actions, called branches:
A step can have one or more branches, but it’s important to understand that branches are not dependent upon one another, and the branches execute in parallel with one another, versus the sequential aspect of steps. Notice that each branch can have one or more actions. As the experiment designer, you control sequential versus parallel fault injection by flexibly nesting actions within branches and branches within steps.
There are at least two other ways that Azure Chaos Studio enables more “curated” experiments. The first is through the use of delays, which are a particular type of action that act as orchestrators for more nuanced control. If you are not too tired of my windy puns, yet, think of them as the calm before the storm.
Examples of Azure Chaos Studio experiments
Consider this experiment, which is one of a couple of base templates provided by Azure:
This experiment causes a VM scale set to shut down. It has just one step, but with two branches. Recall that branches run in parallel. But notice that the first action in the second branch introduces a one-minute delay. The effect of this is to create this sequence of events:
- As soon as the experiment kicks off, disable autoscale on a VM scale set
- One minute later, shut down the scale set
- Let the fault run for ten minutes
- End the experiment and roll back to the prior state and settings
Another way that Azure Chaos Studio offers the experiment designer both control and creativity is through the use of fault properties and parameters. Let’s take a look at those in some example JSON from Azure’s fault library:
Most fault types, like this one, have simple properties, such as the duration you want the fault condition to run, or a set of resource-specific settings you can change.
Others have custom parameters, expressed in key-value pairs, like this one that disables a certificate in an Azure Key vault:.
Back in the Azure Chaos Studio wizard, further creativity is found in this Azure template, where the desired effect is an outage on Azure Entra ID (formerly known as Azure Active Directory):
Shutting down an identity service in production would be bad. A real shipwreck. So rather than injecting a fault directly on that service, you, instead, simply inject a rule change on a Network Security Group (NSG) in order to deny access to the targeted service.
This effectively tests the impact of an Azure Entra ID outage, without mucking about in a service that’s essential to dozens of other systems. Such an approach also limits the number of variables we are introducing to the target application or system at any one time. If that’s what you want.
If you want a more chaotic chaos experiment, you can legitimately nest up multiple actions into branches, and organize those branches into steps, batten down the hatches, and brace for the storm!
Learn more about Azure Chaos Studio (Before pushing that experiment button)
If you’re interested in learning more about how to conduct Azure Chaos Studio experiments, check out this learning path on Pluralsight:
And if you want to learn about a super-power on the AWS Fault Injection Service (FIS), check out my upcoming article: Corralling Chaos with AWS Fault Injection Service (FIS) and CloudWatch