The Value of Zero Downtime Deployments
We practice zero-downtime deployments so that we can deploy during the day.
Downtime sucks, regardless of whether it’s planned or not. As a user, when you visit a site to complete a task but you can’t because it’s down, you’re frustrated. Whether or not the site put up a cutesy banner a week earlier warning you about the downtime isn’t really relevant. Banner or not, you can’t complete your task. With that in mind, meet Olivia, a fictional customer who uses our products in Australia. Let’s spend some time dissecting how planned downtime impacts someone who’s in a different timezone than our US based teams.
Deploying changes with downtime during “off-hours” is standard practice across the industry. It reduces the impact to your users by causing downtime during times of lower usage. However, when your product is used globally, there are no “off-hours”. In Olivia’s case, US “off-hours” are in the middle of her workday. With Olivia in mind, let’s walk through some key considerations of zero downtime deployments and why they’re important. Customers all over the world expect minimal, or preferably no, downtime for the products they rely on and pay for. We also know that teams with frequent and stable deployments have healthier products. Deploying during the day improves engineer job satisfaction since they lead to a healthier work/life balance. And finally, daytime deployments reduce risk since supporting teams who might be needed to address a failed rollout are already working and don’t have to be called in after hours.
Deployments that cause downtimes are frustrating to our customers. In order to provide a better customer experience, such deployments might get moved to “off-hours” which may impact fewer customers in exchange for negatively impacting engineers’ work-life balance. Let’s define what zero downtime deployments mean. Ideally, there is no service disruption for customers, they don’t realize that you’ve deployed a new version of code. The next best option are rolling disruptions for some services like requiring a user to reauthenticate or a feature being unavailable to a subset of users. During the day is an interesting concept, particularly with a globally distributed workforce. To Olivia, day has the literal opposite meaning than for a US based employee. With that in mind, it’s helpful to think of “day” as the working hours of the team that owns the deployment.
You might be thinking, this all sounds great but we’re just not there yet. No problem, incremental progress towards zero downtime deployments is great too! Brainstorm different strategies and architectures that can help you get there. For example, is there a part of your app that goes down on every deployment? If so, is there a way of isolating it so that it only goes down when you change its functionality rather than anything in the app? One of the best places to start is to consistently measure how much downtime each deployment takes. Is there a particular approach that takes less time? Are there steps that could be eliminated or done ahead of time? Get creative and map out your deployment from start to finish. You can also analyze user patterns and understand when your app is used the least. Be careful with this approach as you might further anger the Olivia’s of the world. Sometimes, you can use time differences to your benefit. For example, Fridays are actually Saturdays in Australia and nearby places of the world. Any reduction in downtime is valuable progress for both your customers and to the engineers working on the product. While “off-hours” deployments are a potential solution they can increase engineer burnout and alienate your users in different timezones. Even if you’re not at zero downtime, daytime deployments yet, your users and your team will appreciate any progress you can make!
Checkout our Engineering at Pluralsight document to see all of the statements that shape our engineering culture.