World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
Dan Holloran October 07, 2019Monitoring & Alerting DevOps Collaboration Post-Incident Review On-Call
DevOps and IT operations teams rely on visibility across disparate applications and infrastructure in order to know when a complete service is healthy and when there’s a problem. And, with improved visibility, the team needs processes for collaborating effectively during real-time firefights and when conducting post-incident reviews. Incident investigation, both before and after their occurrence, leads to better on-call schedules, incident response workflows and more resilient services.
But, if incident investigations aren’t taken seriously or are outright ignored, they can be a hindrance. It’s important that each step you take during incident investigation is a useful one – especially in the middle of a firefight. Better incident investigations lead to faster incident response and an improved process over time – reducing MTTA and MTTR while mitigating on-call burnout.
So, we thought it would be worthwhile to cover 5 steps for improving incident investigations in DevOps and IT.
In DevOps and IT, incident investigations occur at two points in the incident management lifecycle – during the incident and afterward. Afterward, you have more time to conduct post-incident reviews and dive into what worked and what didn’t. Whereas, the real-time firefighting is more about surfacing resources and tools when on-call engineers need them. Real-time incident investigations and the way on-call teams collaborate will feed into the way your team holds post-incident reviews.
A centralized solution for incident collaboration and alerting can help you maintain more consistent documentation and visibility into the incident management lifecycle – helping teams focus less on managing tickets and more on resolving problems. Then, you have all the information right at your fingertips for conducting thorough post-incident reviews. As you conduct post-incident reviews, real-time incident investigations begin to get easier, and vice versa.
Now, let’s dive into the 5 actionable techniques that any DevOps or IT team can take to reduce downtime, build more resilient applications and infrastructure and fix issues faster.
Engineering and IT organizations are constantly shifting and adjusting the way they work together in order to deliver features faster without harming overall system reliability. So, teams are shifting to a DevOps mindset and finding new ways to investigate incidents and learn from their mistakes. With these 5 steps, every DevOps and IT team will improve incident investigations – both during the firefight and after-the-fact.
Many DevOps and IT operations teams focus too much on monitoring and alerting within their technical systems and not enough on the people and process behind their services. Visibility into system health and developing observable applications and infrastructure is important but it’s only half of the equation. What happens once you have the right information at your disposal? What steps do you take once you know there’s a problem with one of your services?
Developers and sysadmins tend to focus on the technical performance of applications, networks and servers. But, incidents are inevitable in an era where CI/CD, microservices and cloud-based architecture are commonplace. So, the best way to ensure service resilience is by mobilizing on-call teams quickly and arming incident responders with the context and data they need to quickly remediate issues. How can you improve cross-functional communication between everyone – database admins, sysadmins, frontend developers, backend engineers, etc.?
Better incident investigation comes from learning how to best route alerts and quickly surface context to the people who need it. Incident automation and contextual, collaborative alerting tools like VictorOps can create a single-pane-of-glass view into service health, on-call schedules and team workflows. But, every team is built differently. So, you really need to focus on the types of processes that would work best for your team and start building out an on-call incident response framework that works for nearly any incident, large or small.
While you can have one incident that has one technical root cause such as an API failing, DevOps and IT teams need to understand the problem goes much deeper than that. There’s never one single cause of a failure or an error. There could be a number of decisions and actions that lead up to a major incident – meaning that it’s possible to have prevented the problem before it actually occurred.
By simply reviewing an incident and looking at the technical reasons for failure, you’re not understanding the greater problem within your development and incident management process. Was there a reason QA didn’t notice the problem? Are your staging and production environments different enough to have caused the issue when new code was pushed to prod? Then, even once the issue made it into production, what could’ve been done to detect it faster and find a resolution? Was there a gap in communication or visibility between operations and development at any point?
You’re missing so many important details by focusing only on the technical root cause of one single incident. This approach can help you fix the singular issue but doesn’t better prepare you for other incidents across the entire system. Holistic incident investigation means acknowledging there’s more than one root cause to any incident and then working to resolve all of the problems.
When firefighting happens across disparate communication channels (e.g. Slack, email, SMS, and phone conversations), there’s a lack of visibility across the entire incident’s lifecycle. A lack of transparency around release management and production environments will hinder collaboration between developers and IT operations. So, in order to keep everyone informed and maintain uptime, you need detailed documentation across everything from release management to incident management.
This works best when you can manage the CI/CD pipeline and on-call incident management process in one single tool. IT operations teams can see what developers are pushing through the development pipeline and developers can see what’s happening in production environments. Alongside integrated chat and collaboration functionality, this improves visibility and exposure across the entire system, leading to a more progressive DevOps-minded organization and more reliable services.
By using one centralized tool for collaboration and transparency in a DevOps environment, you can automatically keep detailed documentation of everything that’s happening. Then, instead of managing tickets in the middle of a firefight, you can focus on fixing the problem in real-time. Afterward, you can use the documentation to track everything that happened during incident response and compile the information into informative post-incident reviews.
As with nearly any other DevOps task, incident investigations can be improved with automation at all the right parts of the incident lifecycle. On-call schedules can be integrated with an alert rules engine to automatically notify responders when monitoring metrics surpass certain thresholds. Automation can also route notifications through complicated organizational structures and on-call rotations to get alerts to the right people at the right times. And, maybe even more importantly, silence or delay alerts when they’re not an immediate priority.
Automation can also be used to serve up runbooks when on-call responders need them. Your incident management tool can automatically attach runbooks to recurring or similar incidents, providing on-call responders with helpful instructions as they need them. Runbook automation gives DevOps and IT teams actionable steps they can take toward incident remediation at the exact moment they need them. Normal alerting processes will simply tell an on-call team that something’s wrong but can’t help them understand what they should do.
As teams scale and new people join the team, there will be a certain level of unfamiliarity with legacy applications and infrastructure. The only way to combat a lack of historical knowledge and visibility is by adding automation to the incident lifecycle. With automation in runbooks and alerting, you’ll speed up incident investigations and serve applicable context and instructions to people when they need it.
Any way to improve the cross-functional collaboration between developers and operations will benefit you. Software engineers and IT professionals shouldn’t work in isolation when trying to resolve production incidents. Developers can’t expect to simply throw their code over the wall and let IT deploy it to production. There needs to be a collaborative process for escalations and real-time firefighting that allows both teams to enter the fray at critical moments.
Then, once an issue is resolved, the burden of sharing ideas and learning from failure shouldn’t rest on just one team or person. Post-incident reviews need to loop in everyone affected so you can understand the full impact of an incident – from customer support to engineering. If cross-functional teams do all of their firefighting in a centralized tool and can easily escalate and reroute alerts to different teams and escalation paths, you become better at collaborating in real-time. With the right input from the right people, supported by the right data, incident investigation is easier and MTTA/MTTR continues to drop over time.
The only surefire way to improve incident investigations is by breaking down silos and improving transparency and collaboration across all of the software development and incident management process. The 5 steps for improving incident investigations will drive more resilient services and lead to a more collaborative DevOps-centric organization.
When developers and IT professionals share visibility into each other’s workflows, they can easily interject and be looped into cross-functional conversations. DevOps-minded teams with a plan for real-time incident investigations and collaborative post-incident reviews will be faster than their counterparts when failure strikes.
See how developers and IT professionals are centralizing important deployment and incident data in a single, collaborative tool with VictorOps. Sign up for a 14-day free trial or request a personalized demo to streamline incident management workflows and make on-call suck less.