How to conduct blameless postmortems after an incident

Postmortems help pin down what went wrong and how to stop it happening again. However, blame needs to be kept out of it to foster growth instead of fear.

By Karun Subramanian

Aug 13, 2024 • 14 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

When an incident occurs, everyone wants to know where the point of failure happened. It’s just human nature—during the 2024 CrowdStrike incident, we were all glued to the news, searching for answers for why our Windows machines had been struck with a BSOD. However, while postmortems are a great tool, they can quickly become a blame game, with the focus on who made the mistake instead of how it was made.

This sort of postmortem is poison for your team culture. When employees know they’ll be pilloried if they make a mistake, they become risk-averse. Honesty, learning, and accountability go out the window as people seek to avoid blame, because accepting responsibility for a mistake will be punished. That’s why a good postmortem must always be blameless.

In this article, I’ll cover how to go about conducting a blameless postmortem, the essential components it should include, and useful templates you can use.

What is a blameless postmortem?

A blameless postmortem is a structured process where teams analyze a past incident to document the root cause, business impact, timeline, lessons learned, and action items. The primary goal is to learn from the incident and prevent future occurrences. Blameless postmortems emphasize process improvements over individual fault, encouraging a healthy, open culture where engineers feel safe admitting mistakes.

Blameless postmortems are vital to incident response

Postmortems are useful in any field that includes incident management—the process of quickly resolving issues to limit business impact—and they almost always benefit from being blameless. It’s commonly practiced in Site Reliability Engineering (SRE), but fields such as cybersecurity and cloud computing also benefit from blameless postmortems, or even non-tech fields such as emergency response, manufacturing, or even retail.

Typically, if you’re providing a service that needs to operate within agreed-upon Service Level Objectives (SLOs), you want said system to be reliable. Blameless postmortems are an indispensable tool to help you achieve that.

Components of a successful postmortem

We’ve covered why you’d want to conduct a postmortem. But what about the how? Postmortems must contain the following high-level sections at a minimum:

1. Executive Summary

The executive summary is a brief paragraph or two summarizing the issue. You should avoid going into too much technical detail, and just sum up the particulars for easy inference. Below is an example of what you might include if you were a Site Reliability Engineer:

Due to a failed release on October 15th, 8:00 PM-9:00 PM CDT, 20% of auth microservice instances failed to connect to the backend, causing auth service calls to fail. Failure of the auth service impacted the trading portal, resulting in 200 users being unable to log in to the trading system. The issue was resolved by updating the failed auth service instances at 11:15 PM CDT on October 15th. We have enhanced our monitoring to catch this issue during releases.

2. Business Impact

The whole reason you’re doing a postmortem is because it’s disrupted your ideal state of operations. Quantify the impact with actual figures. If there is a financial impact, make sure to mention this. Continuing with our SRE example, you might write something like this:

10% of our 2000 trading users (200) could not log in between 8:00 PM and 11:15 PM on October 15th. These users could still place orders using our manual trading processing system. There is no financial impact, and we remained within our SLO (Service Level Objective) for the 4-week cycle.

3. Root cause

The smoking gun everyone wants to know about! What actually led to the incident? One classic way to get to the root cause is using the “Five Whys” method. This involves asking the question “why?” five times, each time directing the current “why” to the answer of the previous “why”. The fifth “why” should reveal the root cause of the problem.

Let’s show this in practice:

20% of users saw "HTTP 500 internal server error" when logging on to the trading portal.

Why?

Upon analyzing the logs, the portal was erroring when loading the user profile.

Why?

The portal received an "unknown error" when calling the "auth" microservice.

Why?

Upon analyzing the auth microservice logs, we found that it was erroring out when connecting to the backend Postgresql database. This error only happened to a subset of auth microservice instances.

Why?

The auth microservice had an incorrect database connectivity configuration when connecting to the backend Postgresql database.

Why?

Last night, the auth service's code was released to update the database connectivity configuration, but the code release failed to update some microservice instances, thereby leaving those instances with outdated configurations.

So, ultimately, the root cause was a partially successful code release on a remote service (auth service) on which your application is dependent.

Note: The 5-why method is a powerful mechanism for determining the root cause. However, there may be situations where this method does not apply. For example, if a hard drive fails, you may have to scale down from five “whys” to just two or three.

4. Timeline

Otherwise known as “a series of unfortunate events.” This section provides a chronological account of the incident. The timeline will help the team uncover any future process improvements.

For example, if you’re a Site Reliability Engineer, did you have to wait 25 minutes for the database team to join the war room? If so, you might ask if you can do specific tasks in parallel in the future.

5. Lessons Learned

If you’re not learning any lessons, the incident is bound to occur again. Possible subsections in this section include the following:

What went well?
What did not go well?
What can we do to avoid this in the future?
Did our monitoring catch the issue before the end-users reported the issue? How can we improve?

5. Action Items

It’s all good to know the cause and be wiser for what happened, but you also need an action plan on steps you should take to stop the incident recurring. Every action item must have an owner and target date. The task owner should also be responsible for updating the postmortem when the task is complete.

Developing a blameless postmortem culture

So, we’ve talked about creating a great postmortem, but how do we make it “blameless?” This largely comes down to culture rather than process. You need to create an environment where your staff consistently write postmortems, while overcoming challenges such as time constraints and the natural reluctance of people to own up to mistakes. Here are some strategies you can use:

1. Senior leadership support and participation

Enforcing postmortem culture must come from the top down. If leaders are also reviewing their mistakes, then it leads by example. Additionally, make sure postmortems are reviewed with senior leadership. When leaders are involved, teams are usually more inclined to write postmortems

2. Reward well-written postmortems

Conduct monthly or quarterly reviews of well-written postmortems and reward the teams who wrote them. This will encourage other teams to follow suit.

3. Develop a postmortem repository

All postmortems must be stored in a repository for easy access—Github can be a great option for this. As teams accumulate well-written postmortems, they will utilize past postmortems to troubleshoot new issues. Further, having a central repository of postmortems lends itself to even more intelligent solutions (such as using machine learning to aid with new problems.)

When should you conduct a postmortem?

Not all incidents should qualify for a postmortem. Create incident severity levels, such as Priority One (P1), Priority Two (P2), and Priority Three (P3), with P1 being the most critical for the business. In this scenario, commit to doing postmortems for every P1, and potentially for some P2s and P3s depending on business impact and team sentiment.

Here are my guidelines for determining whether you must do a postmortem:

Organizational mandate (e.g., You’ve committed to do one for all P1 incidents)
The incident resulted in data loss (Or another form of resource loss specific to your industry. E.g. Monetary)
There was a significant user / customer impact
The team feels they should dig deep to understand what happened.

I’d recommend performing your cold postmortem as a group within five to seven business days when your memory is still relatively fresh.

Postmortem templates

As you can already tell, a lot of information goes into a postmortem under various sections. This can be daunting at first, but thankfully, you don’t have to build them from scratch! A very effective way to create a postmortem is to use a well-established template.

There are many postmortem templates available. My favorite is from Google, but feel free to use any of the templates below:

Conclusion: Blameless postmortems are worth the investment

Blameless postmortems are essential for effective incident management. They help teams learn from incidents and prevent future occurrences. For postmortems to be effective, they must focus on systems and processes, not individuals. Senior leadership support and a central postmortem repository further enhance their value, enabling continuous improvement and innovation.

Further learning

Did you find this article helpful? You might enjoy reading these other articles by Karun Subramanian:

Alternatively, you might enjoy trying out Pluralsight's learning path, "Fundamentals of Site Reliability Engineering (SRE)." As the name suggests, this video course teaches you all about SRE, including how to incorporate it into your system design, managing teams for SRE, and implementing SRE best practices. Why not check it out?

Karun S.

Karun is passionate about IT operations. He has 20+ years of hands-on experience in diverse technologies ranging from Linux Administration to Cloud technologies, and everything in between. He specializes in modernizing IT operations with automation, end-to-end monitoring, CI/CD and containerization. He has helped numerous companies implement devops, CI/CD, Monitoring, Log aggregation and Cloud migrations (AWS and Azure).

More about this author

How to conduct blameless postmortems after an incident

What is a blameless postmortem?

Blameless postmortems are vital to incident response

Other reasons to conduct blameless postmortems

Components of a successful postmortem

1. Executive Summary

2. Business Impact

3. Root cause

4. Timeline

5. Lessons Learned

5. Action Items

Developing a blameless postmortem culture

1. Senior leadership support and participation

2. Reward well-written postmortems

3. Develop a postmortem repository

When should you conduct a postmortem?

Postmortem templates

Conclusion: Blameless postmortems are worth the investment

Further learning