VictorOps is now Splunk On-Call! Learn More.
Critical application errors and infrastructure incidents are bound to happen. Highly interconnected systems, microservice architectures and containers mean developers, sysadmins, technical support and IT security analysts can’t simply work in silos. A simple alert regarding client-side latency may not only affect frontend development teams. How do large teams mobilize users across multiple engineering and IT disciplines to fix the root cause of a problem affecting multiple parts of your architecture?
Not only do teams need to know how to fix these types of interconnected problems but they need to know when there’s an issue at all. When a major incident hits your system, multiple alerts will likely fire off and notify multiple users – some who rarely interact with each other in their normal day-to-day routine. So, how do these incident responders come together to quickly identify the problem and find the solution?
That’s where the new VictorOps war room pop-out view comes into play. Major incidents require major collaboration and transparency. Say goodbye to receiving on-call notifications and working on incidents in a dark room, all alone, in the middle of the night with nowhere to turn. The war room includes actionable incident context and remediation tools alongside on-call schedules and alert automation to truly make on-call suck less while helping you reduce the customer impact of downtime and performance errors.
Before we go any further, let’s take a look at what constitutes a “major incident.” And, as you’ll quickly see, this might not look the same for everyone.
Major incident response typically refers to an incident with a wide enough scope to affect multiple stakeholders. These affected stakeholders can be customers, internal engineering and IT teams, or people on the business teams. A major incident, in the context of applications, infrastructure and the overall process of alert management, response and remediation often refers to internal software engineering and IT operations efforts. If multiple stakeholders from different parts of the engineering organization need to come together in order to fix a problem, it’s typically a major incident.
But, if you only think of major incidents as affecting engineering teams, you’re not looking at the whole equation. If you need to maintain a highly-available application for customers, your team’s classification of a major incident may refer more to the impact on users and customers. Oftentimes, a customer-impacting incident will also be a major incident in terms of how it affects your own team. If a critical function breaks, customers can’t use it and your sales team can’t demo it – potentially leading to lost revenue.
The ability to bring together stakeholders at any given time, in real-time, without sacrificing documentation or speed shouldn’t be overlooked by any IT or DevOps-minded department. VictorOps, especially with this refined war room interface, allows you to automate the on-call notifications piece of incident management while improving the way people can collaborate and respond to problems quickly.
The new war room display gives users a way to communicate, take action on incidents, and automatically update or look at related Slack channels or ServiceNow tickets – helping individual responders share applicable context with related stakeholders in the places they’re already working. In one single place, on-call responders can find instructions and tools such as runbooks, wikis, dashboards or post-incident review documentation to diagnose exactly how they can fix a problem. In that same place, you can reroute alerts or add suggested responders to an incident and communicate through your preferred conference call software and/or chat tool to effectively respond to a major problem.
A major problem with major incident response is the inability to know when something is wrong. If you receive one alert but your monitoring isn’t set up to alert on numerous dependencies or connected applications or services, you might only see the tip of the iceberg for an incident. So, it’s even possible for major incidents to not look like major incidents at first glance.
Over time, you can use VictorOps’ war room pop-out view (depicted above) to learn about your system and how your team interacts with the system. Robust reporting and thorough post-incident reviews built into the tool will lead to actionable insights and help you facilitate better monitoring and alerting practices – leading to a better approach to major incident response. Then, with a better understanding of the underlying system’s architecture and applications, the team can make better use of a collaborative war room to share context, breakdown silos and fix issues faster.
In order to get to the new VictorOps War Room pop-out incident view, you can click into an incident in VictorOps and select the button in the upper right corner (shown below).
Every incident is a little bit different – especially as you continue to build out applications and services. Your known unknowns and unknown unknowns continue to change and drive further complexity for IT and DevOps organizations. So, it’s hard to create a single process for incident response and incident management. But, the faster you can learn from previous incidents and implement runbooks and other remediation and orchestration workflows (ideally through automation), the more you can focus on new problems, not old ones.
As you ingest more incidents into VictorOps and resolve problems, you continuously improve your process. People gain more exposure to the entire system – developers are exposed more to production environments while IT operations and QA teams get more access to testing and staging environments – leading to more reliable applications and infrastructure. The war room helps you take care of major incidents quickly while simultaneously reducing the overall burnout of your on-call teams.
The saying, “What doesn’t kill you makes you stronger” makes a whole lot of sense in the context of major incident response. Cross-functional collaboration during a firefight is crucial for restoring service uptime and availability, but it also leads to a more resilient incident management process in the future.
The war room UI is intuitive and gives teams a sort of single-pane-of-glass, digital NOC. Distributed teams who work remotely can use the war room as a way to collaborate, share information and execute remediation processes from the comfort of their home. Data from your monitoring, communication and IT ticketing tools can flow in and out of VictorOps to help you find and collect all of your insights – from APM to NPM to IT ticket management – and so much more.
Without skipping a beat, anyone can be notified of an issue from anywhere, with all of the necessary alert context right at their fingertips. Automation allows for fewer dropped alerts due to missed on-call coverage as well as more detailed alerts to help inform on-call responders faster. Remote teams take advantage of the VictorOps war room to bring in multiple users across the company to help triage incidents quickly and collaborate in real-time – reducing mean time to acknowledge and resolve (MTTA/MTTR) for major incidents.
It’s hard to track the entire value of a collaborative, transparent approach to incident management. For a lot of on-call engineers, the value of this process lies in their ability to sleep easier at night. Engineering managers can feel comfortable with their incident response coverage and knowledge of what’s happening in their architecture while CIOs and CTOs can rest assured their costs of downtime are reduced. Major incident response is never easy, and rarely will things work out perfectly. But, hopefully, the new VictorOps war room view can help on-call engineers feel better about taking accountability for service reliability while simultaneously keeping customers happier. Stay tuned to hear more about rolling updates coming soon to the War Room – making on-call suck even less.
Try out the war room for yourself. Sign up for a 14-day, free trial to see how cross-functional engineering and IT teams are mobilizing multiple responders, automating alerts and making firefights suck less with a collaborative, transparent approach to incident response.