We Are Responsible for Our Systems
We take responsibility for our systems.
Production issues and bugs are a common albeit frustrating part of software development. Imagine a tale of two organizations, with different approaches towards addressing this problem. Globomantics leverages a dedicated production issues and bugs team, allowing other teams to create new features. Carved Rock Fitness does not differentiate, and asks all teams to manage production issues for their particular product area as a part of their daily work.
Whenever a critical issue comes up in the Globomantics platform, the support engineers immediately rally around it immediately and work to resolve it quickly. During the rest of their time, they work on a lengthy backlog of less critical bugs. This backlog often has years worth of work in it, most of which will never be addressed. A significant amount of context switching takes place on this team as they jump from codebase to codebase addressing different issues. Morale on the support team is perpetually low. Engineers feel that they have little control over their workload or ability to address stability concerns in a larger, holistic way. Meanwhile, the product teams continue to build new features for the platform. Morale on these teams is high since they are only working on the newest and most exciting features. Unfortunately, they are often unaware of downstream issues their code creates for the support team.
Carved Rock Fitness focuses on a team’s ability to own their product throughout its entire lifecycle. Engineering leaders see several benefits from not splitting responsibility at the point of release. First, morale remains high across engineers because being on call and handling bugs is distributed evenly. Second, there is a quick feedback loop between production issues and new functionality so repeat issues are avoided. Overall, customer satisfaction and uptime remain high. Engineers don’t have to become reacquainted with the codebase, they’re already working in it everyday.
Pluralsight is strongly aligned with the model that Carved Rock Fitness leverages. We believe that teams are responsible for their entire product, not just the good (new features) but also the bad (bugs). Furthermore, they are accountable to continuously maintain and improving their product. Teams act on this strategy in a number of ways. First, on-call is a shared responsibility among all of the team’s engineers. Next, issues are prioritized and addressed as they are reported. They are weighed against new feature work and triaged appropriately. There is no neverending backlog of issues that will never be tackled. Teams also learn from the bugs that they fix and find ways to incorporate those learnings into future iterations of their product. Third, they are more motivated to write maintainable and tested code, since they’ll ultimately have to deal with the negative consequences of poor code.
There are a few things to look out for with this approach however. First, teams are not only responsible for their own areas of the product but to the entire system. If your area has high uptime because you’ve offloaded all of the risky work to another team, you haven’t accomplished anything. Creating extra work for other teams isn’t a viable solution to improved stability. Next, it may be tempting to offload all bug work and production support to the newest or most junior member of the team. This is a more localized version of Globomantics’ approach and comes with similar challenges. New employee burnout and communication challenges are more likely in this approach. On-call and production support must be rotated between all of the team members. Finally, it’s important to have a good triaging and prioritization process for production bugs that are reported. It’s easy to simply say that all in progress work stops for any production issue. If your product’s stability is so high that bugs happen rarely, this attitude might be fine. However, for teams that have a steadier stream of issues, it’s valuable to weigh the issue’s impact against the current work. Issues that impact a small range of users, have a functioning workaround, or are cosmetic in nature might be lower priority than current feature work.
Handling production support and bugs on the same team that creates the features has a number of benefits. Reduced engineer burnout, increased uptime, and shorter feedback loops all contribute to a healthier product. Equitably sharing on-call and bug fixing responsibilities among engineers allows the team to be successful, morale to stay high, and production incidents to be resolved quickly.
Checkout our Engineering at Pluralsight document to see all of the statements that shape our engineering culture.