VictorOps is now Splunk On-Call! Learn More.
When an incident strikes, the customer doesn’t care who solves the issue – they just want functional systems. So, organizations are constantly tasked with defining incident management processes and refining incident response plans. But, because every team and business is structured a little differently, a one-size-fits-all incident management process doesn’t make sense. Figuring out who owns which parts of the process and creating a blameless culture dedicated to continuous improvement and transparency will drastically improve collaboration and overall service reliability.
Software development and IT continues to change and grow – leading to new processes and team structures. DevOps practices and Agile software development processes are being integrated with ITIL incident management principles – helping teams respond to incidents in real-time across all systems. In this post, we’ll help you define who owns service reliability and go over some methods for building an incident response and management process that keeps customers happy, drives business value and makes on-call suck less for employees.
The short answer – everyone. Both IT operations and software engineers are responsible for maintaining uptime and providing positive customer experiences. So, IT teams and developers need to both take accountability for the services they build and deploy. Instead of forcing operations support and IT to remediate all production issues, developers also take accountability for fixing issues that come up with services they’ve helped create.
By improving transparency and collaboration between IT, development and support throughout all of software delivery and incident response, everyone gets more exposure to their systems. So, more people are equipped with the knowledge to help fix issues or escalate problems to the right person at the right time. No single person or team should take accountability for the reliability of the system – it’s a team effort.
So, let’s take a look at the general incident management process and help you define where your team lands in the incident management maturity lifecycle.
The specific way DevOps and IT operations handle incident response and resolution will change depending on the team. But, every on-call team needs to prepare for what they’d do in case of an emergency. Whether or not your team is fully DevOps-centric or still adhering to ITIL V3 practices, the incident management process always follows five steps:
Of course, you need to notice an incident before you can take action to remedy the issue. So, incident detection is logically the first step of any incident management process. Effective monitoring and alerting can help you get started – but understanding how to attribute alerts to one another and identify the core issue of an incident can change the game. Constant tweaking of monitoring tools and refinement of visibility into node and system health will help teams identify incidents faster. Finding ways to detect incidents sooner and serve the right information to the right people can help you start taking action faster.
Once an incident has been detected, the team needs to mobilize quickly and respond to the issue. A highly collaborative team with organized on-call schedules and rotations alongside automated alert routing and escalations will be the best at incident response. Any possible way to provide more alert context while improving workflow transparency and collaboration will lead to better incident response. Does everyone on the team know who’s on-call at any given moment? Can you quickly determine the source and cause of an alert when you get a notification while on-call? Constant improvement of the way humans interact with technical systems and processes will continuously drive down MTTA and MTTR and help teams maintain more reliable services.
Armed with the right information and a streamlined process for incident response, implementing the actual fix is a lot easier. It’s easy for incidents to get lost in the shuffle of multiple alerts and escalations. So, effective remediation depends greatly on having your incident response plan down pat. Then, give your team the DevOps tools they need to execute commands, pull in additional responders and restore services faster. Through a combination of automation and improved human engagement, incidents are remediated faster and allows developers and operations teams to spend more time building and deploying new features and products.
Once the issue is resolved, the DevOps or IT team still needs to conduct a post-incident review. Thorough post-incident reviews will expose areas for improvement, weaknesses in your response strategy and blind spots in your monitoring and alerting stack. By endorsing a blameless culture and collecting incident information on a regular basis, every incident becomes a learning opportunity. Incident analysis shows the team exactly what went wrong, how the on-call teams responded and how they can avoid similar incidents in the future.
Now, armed with post-incident knowledge, the team can prepare for the next time an incident strikes. Update runbooks and playbooks, define new alert routing rules or make adjustments to escalation policies. How can you get alerts to the right person faster? What additional information could have been appended to an alert? What else could have been helpful for rapid incident response and resolution? In a modern era of CI/CD and highly-complex technical architecture, incident preparation is the only surefire way to maintain a consistent release pipeline without negatively impacting customers.
The way your team approaches the incident management process will depend on how mature your IT and engineering organization is. The earlier you can implement a holistic, proactive incident management and response process, the easier it will be to scale on-call operations and continuously deliver new software. Smaller applications and services are typically easier to wrangle and therefore allows you to be more reactive. But, as your architecture grows, you’ll begin to see more alerts and production incidents. So, it’s important to move to a more holistic incident management model right now – because it only gets harder later.
So, let’s dive into the different levels of the incident management maturity lifecycle to help you understand where your own team lands today:
Very little visibility or awareness of system performance or overall health
A disjointed approach to communicating during an outage (email, SMS, incident management mobile app, Slack, etc.)
Undefined or confusing on-call rotations, personnel roles and processes across the entire incident management lifecycle
Unorganized monitoring and alerting strategies that are disconnected from collaboration tools and on-call schedules, notifications, escalations and remediation qualifications
A general incident response framework, processes and tooling for monitoring, alerting and communication
Some segmentation in on-call teams, personnel roles, alert routing rules and incident prioritization
Defined methods for communicating during a firefight and a basic method for recording human interaction during an incident
Roughly defined policies and procedures for incident management across most teams with improved visibility into the entire lifecycle
The team starts conducting deeper post-incident analysis – learning from post-incident reviews and cross-department knowledge sharing
Triage documentation and runbooks are provided to on-call responders
Alerts come with context – surfacing applicable incident metrics, traces and logs in real-time
Alerts are rarely dropped or missed due to effective alert routing and on-call coverage
The team consistently improves visibility and collaborates cross-functionally in a standardized way throughout the entire incident lifecycle.
Many systems self-heal, incidents self-resolve and alert fatigue is reduced through highly intelligent, automated alert routing processes
The team is monitoring and reporting on advanced metrics, thresholds and alerts
Not only are incidents detected and responded to faster, but on-call quality of life is improved for employees
Defined methods for collaboration and communication in-line with context and automation – leading to rapid incident response and greater workflow visibility – all the things that make on-call suck less
Have implemented a continuous improvement strategy for the entire incident management process – complete with post-incident reviews, incident management KPIs and metrics and a better understanding of the team’s incident collaboration
As software engineering and adoption of technology grew, ITIL was implemented as a set of principles dictating effective IT service management (ITSM). But, as CI/CD practices and cloud adoption grew, IT operations practices needed to change too. So, DevOps became a methodology for improving collaboration and visibility across IT and engineering throughout the entire software development lifecycle. And, as teams continuously improve the speed and reliability of service delivery, they’re being tasked with improving the way they respond to incidents in production.
Developers aren’t simply handing code to IT operations and letting them deploy, maintain and fix any production issues. DevOps is allowing everyone to take accountability for the services they build and maintain – giving developers on-call responsibilities and ownership of their own applications and infrastructure. As time goes on, the basics of ITIL practices still persist but are beginning to bleed into DevOps ideals focused on collaboration and transparency.
The incident management process will look different between every team. But, developers and operations teams who understand the incident management lifecycle can use their knowledge to constantly improve the way they respond to incidents, maintain positive customer experiences and reduce the costs of downtime. Before we close out the article, we wanted to quickly review three easy tips for improving on-call incident management and response:
The more you can test throughout the software delivery lifecycle, the less likely it is for a production incident to occur. Combined with game days, the team is constantly testing their technical systems while practicing their incident response processes. Not only does this allow developers and IT professionals to learn more about their systems but they’ll also learn how they work together. Then, when an incident does come up, the team has the technical expertise and the preparation they need to rapidly remediate issues – many times before customers even notice a thing.
With post-incident reviews and a strategy for continuous improvement, the team will constantly get better at on-call incident management. Constant tweaks to alert rules, updates to runbooks and changes to monitoring tools and thresholds is required as your system changes. In order to continuously deploy and integrate, you need to continuously adjust incident management procedures. More than anything, continuous improvement starts with an organizational culture shift leading to blameless conversations and constant refinement of operations.
A central source of truth for on-call, alerting and communication is a requirement for an efficient incident management process. By tracking all incident information, routing alerts to the right person at the right time and communicating in one place, the team can easily determine what’s working and what’s not with their incident management process. Working across disparate tools, systems and teams creates a lack of transparency and hinders the continuous improvement process.
So, who owns the incident management process? Well, everyone does. From product manager to release manager, from customer support to front-end developers, the team is responsible for maintaining reliable applications and infrastructure. Better collaboration, transparency and automation throughout software delivery and incident response will help you continuously improve your incident management process – making customers and employees happy while driving business value faster.
Learn more about effective incident management and making on-call suck less in our free eBook, The Incident Management Buyer’s Guide. Get your copy today!