How to conduct blameless postmortems after an incident
Postmortems help pin down what went wrong and how to stop it happening again. However, blame needs to be kept out of it to foster growth instead of fear.
Aug 13, 2024 • 14 Minute Read
When an incident occurs, everyone wants to know where the point of failure happened. It’s just human nature—during the 2024 CrowdStrike incident, we were all glued to the news, searching for answers for why our Windows machines had been struck with a BSOD. However, while postmortems are a great tool, they can quickly become a blame game, with the focus on who made the mistake instead of how it was made.
This sort of postmortem is poison for your team culture. When employees know they’ll be pilloried if they make a mistake, they become risk-averse. Honesty, learning, and accountability go out the window as people seek to avoid blame, because accepting responsibility for a mistake will be punished. That’s why a good postmortem must always be blameless.
In this article, I’ll cover how to go about conducting a blameless postmortem, the essential components it should include, and useful templates you can use.
What is a blameless postmortem?
A blameless postmortem is a structured process where teams analyze a past incident to document the root cause, business impact, timeline, lessons learned, and action items. The primary goal is to learn from the incident and prevent future occurrences. Blameless postmortems emphasize process improvements over individual fault, encouraging a healthy, open culture where engineers feel safe admitting mistakes.
Blameless postmortems are vital to incident response
Postmortems are useful in any field that includes incident management—the process of quickly resolving issues to limit business impact—and they almost always benefit from being blameless. It’s commonly practiced in Site Reliability Engineering (SRE), but fields such as cybersecurity and cloud computing also benefit from blameless postmortems, or even non-tech fields such as emergency response, manufacturing, or even retail.
Typically, if you’re providing a service that needs to operate within agreed-upon Service Level Objectives (SLOs), you want said system to be reliable. Blameless postmortems are an indispensable tool to help you achieve that.
Other reasons to conduct blameless postmortems
Let’s not sugar-coat it—conducting postmortems of any kind are rarely enjoyable. However, there are several benefits that make it worthwhile:
- Comprehensive Incident Understanding: Postmortems offer a detailed account of the incident, including duration, user impact, financial impact, root cause, and preventive actions.
- Root Cause Analysis: Using methods like the Five Whys, postmortems can uncover underlying issues within systems.
- Learning Opportunities: Postmortems help teams reflect on what could have been done differently, promoting continuous improvement.
- Preventive Actions: They identify gaps in monitoring and other preventive measures.
- Accountability: Clear action items with owners and deadlines ensure follow-through on improvements.
Components of a successful postmortem
We’ve covered why you’d want to conduct a postmortem. But what about the how? Postmortems must contain the following high-level sections at a minimum:
1. Executive Summary
The executive summary is a brief paragraph or two summarizing the issue. You should avoid going into too much technical detail, and just sum up the particulars for easy inference. Below is an example of what you might include if you were a Site Reliability Engineer:
Due to a failed release on October 15th, 8:00 PM-9:00 PM CDT, 20% of auth microservice instances failed to connect to the backend, causing auth service calls to fail. Failure of the auth service impacted the trading portal, resulting in 200 users being unable to log in to the trading system. The issue was resolved by updating the failed auth service instances at 11:15 PM CDT on October 15th. We have enhanced our monitoring to catch this issue during releases.
2. Business Impact
The whole reason you’re doing a postmortem is because it’s disrupted your ideal state of operations. Quantify the impact with actual figures. If there is a financial impact, make sure to mention this. Continuing with our SRE example, you might write something like this:
10% of our 2000 trading users (200) could not log in between 8:00 PM and 11:15 PM on October 15th. These users could still place orders using our manual trading processing system. There is no financial impact, and we remained within our SLO (Service Level Objective) for the 4-week cycle.
3. Root cause
The smoking gun everyone wants to know about! What actually led to the incident? One classic way to get to the root cause is using the “Five Whys” method. This involves asking the question “why?” five times, each time directing the current “why” to the answer of the previous “why”. The fifth “why” should reveal the root cause of the problem.
Let’s show this in practice:
20% of users saw "HTTP 500 internal server error" when logging on to the trading portal.
Why?
Upon analyzing the logs, the portal was erroring when loading the user profile.
Why?
The portal received an "unknown error" when calling the "auth" microservice.
Why?
Upon analyzing the auth microservice logs, we found that it was erroring out when connecting to the backend Postgresql database. This error only happened to a subset of auth microservice instances.
Why?
The auth microservice had an incorrect database connectivity configuration when connecting to the backend Postgresql database.
Why?
Last night, the auth service's code was released to update the database connectivity configuration, but the code release failed to update some microservice instances, thereby leaving those instances with outdated configurations.
So, ultimately, the root cause was a partially successful code release on a remote service (auth service) on which your application is dependent.
Note: The 5-why method is a powerful mechanism for determining the root cause. However, there may be situations where this method does not apply. For example, if a hard drive fails, you may have to scale down from five “whys” to just two or three.
4. Timeline
Otherwise known as “a series of unfortunate events.” This section provides a chronological account of the incident. The timeline will help the team uncover any future process improvements.
For example, if you’re a Site Reliability Engineer, did you have to wait 25 minutes for the database team to join the war room? If so, you might ask if you can do specific tasks in parallel in the future.
5. Lessons Learned
If you’re not learning any lessons, the incident is bound to occur again. Possible subsections in this section include the following:
- What went well?
- What did not go well?
- What can we do to avoid this in the future?
- Did our monitoring catch the issue before the end-users reported the issue? How can we improve?
5. Action Items
It’s all good to know the cause and be wiser for what happened, but you also need an action plan on steps you should take to stop the incident recurring. Every action item must have an owner and target date. The task owner should also be responsible for updating the postmortem when the task is complete.
Developing a blameless postmortem culture
So, we’ve talked about creating a great postmortem, but how do we make it “blameless?” This largely comes down to culture rather than process. You need to create an environment where your staff consistently write postmortems, while overcoming challenges such as time constraints and the natural reluctance of people to own up to mistakes. Here are some strategies you can use:
1. Senior leadership support and participation
Enforcing postmortem culture must come from the top down. If leaders are also reviewing their mistakes, then it leads by example. Additionally, make sure postmortems are reviewed with senior leadership. When leaders are involved, teams are usually more inclined to write postmortems
2. Reward well-written postmortems
Conduct monthly or quarterly reviews of well-written postmortems and reward the teams who wrote them. This will encourage other teams to follow suit.
3. Develop a postmortem repository
All postmortems must be stored in a repository for easy access—Github can be a great option for this. As teams accumulate well-written postmortems, they will utilize past postmortems to troubleshoot new issues. Further, having a central repository of postmortems lends itself to even more intelligent solutions (such as using machine learning to aid with new problems.)
When should you conduct a postmortem?
Not all incidents should qualify for a postmortem. Create incident severity levels, such as Priority One (P1), Priority Two (P2), and Priority Three (P3), with P1 being the most critical for the business. In this scenario, commit to doing postmortems for every P1, and potentially for some P2s and P3s depending on business impact and team sentiment.
Here are my guidelines for determining whether you must do a postmortem:
- Organizational mandate (e.g., You’ve committed to do one for all P1 incidents)
- The incident resulted in data loss (Or another form of resource loss specific to your industry. E.g. Monetary)
- There was a significant user / customer impact
- The team feels they should dig deep to understand what happened.
I’d recommend performing your cold postmortem as a group within five to seven business days when your memory is still relatively fresh.
Postmortem templates
As you can already tell, a lot of information goes into a postmortem under various sections. This can be daunting at first, but thankfully, you don’t have to build them from scratch! A very effective way to create a postmortem is to use a well-established template.
There are many postmortem templates available. My favorite is from Google, but feel free to use any of the templates below:
Conclusion: Blameless postmortems are worth the investment
Blameless postmortems are essential for effective incident management. They help teams learn from incidents and prevent future occurrences. For postmortems to be effective, they must focus on systems and processes, not individuals. Senior leadership support and a central postmortem repository further enhance their value, enabling continuous improvement and innovation.
Further learning
Did you find this article helpful? You might enjoy reading these other articles by Karun Subramanian:
- Uptime vs. Availability: How to measure and improve reliability.
- SRE: How making systems observable improves their reliability
Alternatively, you might enjoy trying out Pluralsight's learning path, "Fundamentals of Site Reliability Engineering (SRE)." As the name suggests, this video course teaches you all about SRE, including how to incorporate it into your system design, managing teams for SRE, and implementing SRE best practices. Why not check it out?