Blog | 11 key DevOps metrics to measure team success
DevOps metrics are a way to track the stability and velocity of your software delivery and identify areas of improvement.
Mar 14, 2023 • 3 Minute Read
DevOps metrics are regarded as the industry standard for evaluating the reliability and quality of software delivery within your organization. In tracking these metrics, you’re able to identify bottlenecks that are slowing down your delivery and causing failures in deployed code.
From an operational perspective, DevOps metrics provide data-driven insights that help you continuously improve and deliver better software and more value to your customers. Isn’t that just music to your executive team’s ears?
Below, we cover the four key DevOps metrics (commonly known as the DORA metrics) plus seven other metrics to measure and improve your team’s performance.
What are the 4 key metrics in DevOps? DORA metrics to know
Google’s DevOps Research and Assessment (DORA) team spent six years conducting surveys to study engineering teams and their DevOps processes. The group began publishing its findings in 2014 with the first State of DevOps report, and has continued to release yearly updates.
In the first report, the DORA team outlined four key metrics to track software development team performance.
Those metrics are:
Deployment frequency
Lead time for changes
Time to restore service
Change failure rate
These DORA metrics are widely adopted today, but at the time of their release, they were seen as a set of industry standards for teams to benchmark their performance against.
The metrics can be used to track team performance by measuring whether your teams are “low”, “medium”, “high”, or “elite” performers. By continuing to track these metrics, you’ll have a clearer picture of how your team has developed over time and what areas may still need improvement.
1. Deployment frequency
Deployment frequency is a measure of how often code changes are released to production. In general, smaller or more frequent deployments pose less risk and put you in a state of continuous delivery.
Elite teams are able to perform on-demand deployments because software is in a constantly releasable state—and ideally, deployed daily. Low-performing teams tend to produce large deployments over the span of months, which can impact velocity and increase the risk and impact of deployment failure.
How to increase your deployment frequency: Shrink your deployment size. Rather than a large number of features and changes, your releases should be a single feature or change and each update should be as small as possible.
Smaller deployments make it easier to deploy more frequently. And if errors do occur during a deployment, they’ll have a smaller impact, and you’ll be able to more quickly identify the issues within a small deployment.
2. Lead time for changes
Lead time for changes is the time it takes for a developer’s committed code to reach production. This metric serves as an early indicator of process issues and helps you pinpoint bottlenecks that are slowing down your software delivery.
An elite team takes less than an hour from when code is checked to when it’s deployed. A low-performing team can take more than six months to make and deploy changes.
How to reduce your lead time for changes: Utilize software to help you identify if commits are stuck in waiting states, like waiting for QA testing. Once you discover how your testing process is delaying deployments, you can automate aspects of testing during production or hire additional QA testers to address these bottlenecks.
3. Time to restore service
Time to restore service, or mean time to recovery (MTTR), is a measure of how long it takes your team to recover from a failure in production. This metric is huge from an operational perspective, as the quicker you can respond, the better the customer experience will be.
To measure the time to restore service, you need to know the timestamp of when the incident occurred and when it was resolved. You also need to know what deployment resolved the incident.
An elite team typically takes less than an hour to get services up and running again. A low-performing team tends to take more than six months to restore services.
How to reduce your time to restore service: Decrease the size of your deployments. TTRS works hand in hand with deployment frequency—by reducing the size of your deployments, you can reduce the impact radius if something does go wrong.
A smaller deployment that fails is going to be easier to troubleshoot and restore compared to a larger deployment. If you have a long TTRS, this may also indicate you need better rollback systems to help you reverse a flawed deployment and quickly get back on your feet.
4. Change failure rate
Change failure rate is the percentage of deployments that result in a failure. This metric measures the stability of the code your team is shipping, as well as its quality. It’s calculated as a percentage of deployments that result in a severe service failure and require immediate remediation, such as a rollback or patch.
Elite teams stay within the 0%-15% change failure rate, while low-performing teams have a change failure rate between 16%-30%.
Oftentimes teams struggle initially with defining what a failure is. This definition may vary from company to company and even team to team. Failure in DevOps can be about code and it can also be about outcomes. Did a feature work the way it was expected? Or did a feature completely miss the mark of user intent?
Change failure rate should ideally measure when a deployment results in degraded performance. However, this is something that should be defined as a team.
How to reduce your change failure rate: Discover the root cause of failures. Deployments often fail because of deployment error, poor testing, or poor code quality. Human error is a leading cause of deployment errors, so implementing deployment automation can remove the human element and ensure that your code review process and requirements are resulting in thorough, meaningful, and helpful code reviews.
If poor testing is the cause, consider automating testing. To improve code quality, revisit your code review process to ensure junior developers are learning from senior team members.
The four key metrics: How does your team stack up? | ||||
Deployment frequency | Elite: On demand (multiple days per week) | High: Between once per week and once per month | Medium: Between once per month and once per six months | Low: Fewer than once per six months |
Lead time for changes | Elite: Less than one hour | High: Between one day and one week | Medium: Between one month and six months | Low: More than six months |
Time to restore service | Elite: Less than one hour | High: Less than one day | Medium: Between one day and one week | Low: More than six months |
Change failure rate | Elite: 0%-15% | High: 16%-30% | Medium: 16%-30% | Low: 16%-30% |
Why are DevOps metrics important for engineering teams and managers?
DORA metrics allow you to gain insight into the two key predictors of successful engineering teams: throughput and stability. Throughput is the speed at which software is delivered to the end user, and stability is the reliability of that software to perform as expected without failure.
Throughput is measured by deployment failure and lead time for changes.
Stability is measured by time to restore service and change failure rate.
These metrics are designed to work in tandem. Focusing on just one could impact the performance of another. As a whole, they provide a clear picture of team health and how well internal DevOps processes are working. In turn, leaders can use these metrics to advocate for their team and improve customer experience.
The benefits of DevOps include helping your teams focus on the customer experience and simplifying the goals of each release to help you move faster than your competitors.
Additional DevOps metrics to track
While DORA metrics are a powerful tool for engineering teams, we’ve highlighted a few more metrics that can help you build upon your DevOps foundations.
5. Cycle time
Cycle time is a measure of how long it takes your team to deliver once they start working on a task. Cycle time on its own is an informative metric showing speed of delivery, but it can be even more valuable if you dig deeper into what’s impacting your cycle time.
How to reduce cycle time: If a feature took longer than expected to deliver, you can look into queue time, which will tell you how much time work spent in a waiting state. If you spot long-running pull requests as a recurring bottleneck in your process, consider analyzing your code review process and communicating the importance of timely peer reviews.
6. Deployment time
Deployment time is a measure of how long it takes to deploy a release into a testing, development, or production environment. This metric allows teams to see where they can improve deployment and delivery methods.
How to reduce deployment time: If a deployment takes upwards of an hour, that’s probably an indication that something is wrong. A solution could be to optimize your CI/CD deployment pipeline so it’s more streamlined and properly resourced to push code to production more efficiently.
7. Mean time to failure
Mean time to failure (MTTF), also known as uptime, is the average amount of time a system is able to run before it breaks. This metric is used to monitor the status of non-repairable systems and measure how long a component will perform before it fails.
MTTF allows teams to understand how long a critical component will continue to work before it needs to be replaced, helping them prepare for costly system failures.
How to improve MTTF: Invest in monitoring tools. Monitoring applications, logs, and tracing are all key to helping you detect and quickly repair failures. You can also look into adopting automated workflows to help discover and document issues so your team can focus on repairing the system.
8. Mean time between failures
Mean time between failures (MTBF) is a measure of the average time between repairable failures of a system or product. This is a key reliability and availability metric that can show how well your team can prevent and reduce potential failures. The higher the MTBF rate is, the more reliable the system is.
If your MTBF is high, this shows that your product has few incidents or that it’s being repaired quickly. If your MTBF is low, you can evaluate how often your product is undergoing maintenance and whether you need to be tracking other failure metrics more closely.
How to reduce your MTBF: Focus on improving your time to restore service or MTTR. These metrics work together, so the more you can reduce your time to restore service, the higher your MTBF will be. This may look like a renewed focus on preventative maintenance and training team members on incident management action plans for when systems go awry.
9. Defect escape rate
Defect escape rate is a measure of the number of bugs that escaped detection and are released into production. This metric helps you determine the effectiveness of your testing methods.
How to reduce your defect escape rate: A high defect escape rate indicates bad code review and bad testing processes. A low defect escape rate indicates great code review and great testing processes. As a general rule of thumb, teams should aim to catch about 90% of defects in the QA phase prior to release.
10. Application usage and traffic
Application usage and traffic measures the number of users accessing your system after deployment. Tracking usage and traffic over time helps give your team a picture of what “normal” traffic looks like so that when you spot an abnormality, you can dig deeper to find the root cause.
11. Error rates
Error rates measure the number of errors that occur within a given timeframe. Errors are grouped into two categories:
Bugs: These are errors that are found in the code after deployment, assuming the software passed all tests in the QA phase.
Production issues: These are issues that occur with other components external to the software, such as an API gateway.
How to reduce your error rate: If you’re experiencing a high error rate, this may be an indication that your team is recklessly deploying or that your testing methods are inefficient.
How to prioritize the right metrics for your team
The goal of tracking any of the metrics listed above is to maximize efficiency and improve delivery. While there’s no clear guideline for which particular metrics you should track, the four DORA metrics are a great place to start as you begin scaling DevOps.
These metrics provide insight into the stability and throughput of your DevOps practices and can serve as a compass of sorts, pointing you in the direction of what can be improved.
It’s also important to define a clear goal for why you’re tracking certain metrics. Rather than tracking every metric, consider the ones most likely to increase your business value. There’s a fine line between tracking metrics to make actionable improvements and tracking metrics for vanity’s sake.
How Pluralsight Flow can level up your DORA measurement
These metrics provide insight into the stability and throughput of your DevOps practices and can serve as a compass of sorts, pointing you in the direction of what can be improved. + You can also try value stream mapping to visualize the data these metrics track.
Collecting data will only get you so far. While DORA metrics provide a solid foundation for tracking stability and throughput, the responsibility is on you to improve processes based on the data you gather.
Pluralsight Flow allows you to track DORA metrics with the actionable metrics of Flow to help you reduce developer friction and accelerate delivery. Flow Retrospective reports include all four key DORA metrics to help you uncover patterns about your deployments and incidents, and facilitate data-driven decision-making.
To learn more about how Pluralsight Flow can help you optimize the way you gather and use DevOps metrics, schedule a demo with our team today.