Uptime vs. Availability: How to measure and improve reliability
Trying to figure out which metric to use to improve your application's reliability? Here's the difference between the two, which to use, and how to measure them.
Apr 23, 2024 • 7 Minute Read
When it comes to measuring the reliability of your application, we usually use uptime. Uptime refers to the duration a service was up and running. It is generally represented as a percentage. However, to improve the user experience, one must focus on availability, including the user experience component.
In this article, I'll clarify the difference between availability and uptime and point out the correct ways to measure reliability.
Want to learn how improving your system's observability can also increase their reliability? Read my article, "SRE: How making systems observable improves their reliability."
What is uptime?
Uptime refers to the time an application was up and running during a period. This is pretty straightforward. For example, if your application didn't have any outages in the past 30 days, you have 100% availability. If it had a one-day outage, the uptime is 96.67%. Uptime is usually represented as a percentage of the time an application is up and running.
What is availability?
Availability refers to the percentage of time your application functions correctly to serve its users. Note that to measure availability accurately, we must include the user experience component.
Many organizations simply use uptime to refer to availability. The formula for calculating availability in this case is as follows:
You can also measure the availability using a formula that considers the Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR). That formula is as follows.
MTTF refers to the mean (average) time elapsed before a failure occurs. MTTR refers to the mean time to recover the system to be fully operational after an outage.
From this formula, you can readily derive a couple of observations:
First, the shorter the MTTR, the better the availability since MTTR is in the denominator. This is why having the necessary troubleshooting tools, including a dependable end-to-end observability system, is essential for higher availability.
Second, availability will be higher when we have fewer outages (less frequent failures). For example, if your service fails every day for 5 minutes, your availability can be calculated as follows:
MTTF = 24 hours
MTTR = 5 minutes
Availability = 24 / ( 24 + 5/60) = 99.65%
However, if your application fails once a month for 5 minutes:
MTTF = 30 days = 30 X 24 = 720 hours
MTTR = 5 minutes
Availability = 720 / ( 720 + 5/60 ) = 99.99%
You can now see the importance of less frequent outages and shorter repair time.
The availability table
The following table shows the amount of time your application can be unavailable for a time period for a given availability target.
For example, for a 99.99% availability (my recommendation for a mission-critical web application), an application can have a maximum of 52.6 minutes of downtime per year.
Measuring MTTF and MTTR accurately can be challenging. In practice, a set of well-defined SLO (Service Level Objectives) that incorporate the user experience can be used to measure reliability. Now, let’s look at what you need to measure to elevate the reliability of your application.
What metrics should I measure?
Among the many metrics one could measure using monitoring tools, we should carefully pick the ones that directly influence the user-experience. For example, while measuring the CPU utilization of your servers can provide valuable troubleshooting information, it will not reveal what the users are experiencing. A better metric to measure is the latency (response time) that the user experiences.
In Site Reliability Engineering, the metrics we measure to derive a service level objective (SLO) is called Service Level Indicators (SLI). I’ll provide my recommendations for good SLIs per the application type below:
SLI (Service Level Indicators) recommendations
1. Web applications
These are applications with a web browser interface with which users can interact. There could be one or more backend services that the web application may depend on. An example of a web application would be an online e-commerce website where people can purchase goods. The SLIs for this type of application are:
Number of HTTP requests that complete successfully (HTTP Status code 200 family), measured at the web server or the load balancer
The latency (response) time of the web requests in milliseconds, measured at the web server or at the load balancer
Response times of specific functions in milliseconds. For example, adding an item to cart, completing the purchase, logging in to the application etc.
2. APIs
These are generally micro services that provide specific functionality via APIs. For example, an authentication service can authenticate a user logging in via the web browser. The SLIs for this type of workload are similar to those for web applications.
Number of HTTP 500 errors (Internal Server Errors) measured at the load balancer or API Gateway
Number of HTTP requests that complete successfully (HTTP status code 200 family), measure at the load balancer or API Gateway
The latency (response) time of the API requests in milliseconds, measured at the load balancer or API Gateway
3. Backend applications
These are backend applications that process data. A classic example of this would be a file transfer application that moves files between systems, often between two companies. Recommended SLIs here are:
The number of failed file transfers per day
The percentage of failed records per file. Consider categorizing the failure type if applicable. For example, failure may be due to malformed data, incorrect destination, data validation checks, etc.
Average processing time and P95 processing time (95th percentile) of the data
4. Desktop applications
These are thick client applications that run on users’s desktops. They usually connect to a backend system to access data. SLIs recommended here are:
Application startup time: Measured in milliseconds
Latency for specific functions: For example, a file upload function.
Number of client-side errors
Cache-hit ratio: Measured as a percentage
5. Big data processing
These backend applications process vast quantities of data (multiple terabytes per day), perhaps for machine learning use cases. They can also be data pipelines that parse, filter, and transport data. Here are the SLIs for these types of applications:
Data loss, measured as the percentage of records
The number of failed retries
Data parsing failures
Queue fill ratio (indicating the messages/records are starting to queue)
6. Monitoring platforms
These are observability platforms that monitor your applications. Sure, we need to measure the reliability of these systems to make sure they are available when needed. Here are the recommended SLIs:
Availability: Measured as the percentage of time a user can view metrics, and perform queries successfully
Response time for searches and dashboard loads: You should use a synthetic probe to measure this so that you can compare it to a known response time.
Data freshness: How quickly metrics are ingested into the platform
Data accuracy: Again, a synthetic probe can help measure data ingestion and retrieval accuracy.
Conclusion
Reliability is a crucial attribute for any successful application. To maintain high levels of reliability, one must consistently measure the correct metrics. While a simple uptime metric can provide some sense of reliability, to accurately measure reliability, one must measure the metrics that directly impact the user experience. For this purpose, you must define Service Level Objectives and Service Level Indicators that match the type of your application and the specific functionalities it serves.
Want to learn more about SRE best practices?
Pluralsight offers a Fundamentals of Site Reliability Engineering (SRE) learning path that teaches you all about SRE and how to implement it. It covers a wide range of topics starting with the foundations of SME and how to incorporate it in your system design, to more advanced topics like how to manage SRE teams, and implementing effective incident response and change management.
If you liked this article, I’d highly recommend checking out my course, “Implementing Site Reliability Engineering (SRE) Reliability Best Practices.” Best of luck on your journey to implement observable, reliable systems!