Not too long ago I was visiting my brother-in-law and he was telling us about some of the more unusual incidents he’s had to respond to as a volunteer firefighter. Putting the colorful storytelling aside here, there was a common thread with all of it: ownership.

If there is an accident on the road, or a burning building, or pretty much any incident where emergency services are required, the firefighters own the scene until an all-clear is given. Later, if investigators are out recreating the scene, the police own the same scene, with firefighters blocking the road and providing any secondary support. In each situation, there is a clearly defined chain of command to make decisions and protect the first responders. Others can make requests (“Move the fire truck, my officers can’t get their cruisers in close.”), but the ultimate decision rests on the commanding individual (“No. They’ll get in the way of my crew extracting the victim from the wreckage. Now tell your officers to divert traffic and make space for the helicopter to land so we can airlift them out of here.”)
However, in the IT world, in an outage scenario, who owns the outage?
Yes, I know, there are quite a few ways to go about this, and if you’ve got good processes in place, you already know. What if you don’t?
As a Release Manager, part of my role was to own any potential incidents and outages and delegate any task as needed, including the ownership of the outage. Any incident which was discovered went to a key list of individuals for us to determine the best course of action, including no action. This was me, the Release Manager, the QA Lead, the Engineering Managers, the Architect, and the VP of Engineering. Before this was in place, management, and even half the engineers, had no idea something was happening, and thus the outage would just drag on until someone finally did something about it. It was an unfortunate case of “someone else will fix it” more often than not. Adding ownership to this scenario helped ensure that not only was the issue recognized faster, but there was a single point of contact to direct inquiries and information through. Now, instead of management disturbing the individuals actively working to fix the outage, they go the Release Manager (or whomever I handed ownership to). The same thing occurs with the engineering and operations teams, where they would pass information up to me, and I would be responsible for distributing the appropriate information to management. Through ownership of the issue, both management and the individual contributors were able to focus on what needed to be done to resolve the incident.
Another part of my role was after the incident was resolved. At this point, I had ownership of the analysis of the incident. Just as the Police, when recreating an accident at the scene can request the assistance of the firefighters, I had the authority to pull the necessary engineers out of their normal work to be able to perform a root cause analysis (RCA) of the incident. While this RCA often occurred at the end of a Sprint in the discussions over what went wrong/could have gone better, key information could potentially be forgotten if the incident had occurred at the beginning of the Sprint. Getting everyone involved in a room the day after an incident at the latest, ensuring that timelines and steps taken were still fresh in everyone’s mind.
Having an individual dedicated to incident ownership in a smaller organization may not be the most optimal use of resources. In that scenario, I would strongly recommend (if you don’t have one already) that you have an incident response document that outlines the steps to be taken. Then, incident ownership can fall to any member of the team, including junior members, so that there is a single point of communication in and out of the engineering and operations teams (or single multi-function team of course) for the stakeholders. This would also help to define getting the RCA and any associated documentation completed in a timely manner.
I’d be interested to hear how others handle their incident responses within their organizations.