Your Guide to Collaborative Incident Response
Incident: A problem, represented by an alert, that could negatively impact customers, your employees, and the stakeholders inside or outside of your organization.
In order to stay competitive in today’s market, businesses are expected to innovate — quickly. Many engineering teams feel pressure to build, deploy, and operate services with increasing speed. High performing teams innovate faster and maintain their sanity because they’re able to quickly recover from incidents.
As we move from agile development to rapid deployment, teams need to think beyond a reactive operations center. That’s why choosing the right on-call and incident response system is more than just the icing on the cake to a successful DevOps culture. Incident response is the cornerstone to engaging high-performing engineering and ops teams who champion uptime and own on-call — instead of fear it. Ultimately, rethinking and retooling your approach to DevOps and incident response is imperative to delivering products and applications that keep businesses relevant.
The purpose of this buyer’s guide is to discuss why progressive, high-performing teams choose to invest in high-performance incident response software. From the challenges across the SDLC to specific incident response product features, we’ll lay out everything you need to consider when choosing an incident response solution.
Common Faced Issues Without Established Incident Response
- Alert noise and fatigue
- Disorganized communication
- Poor alert flow from disparate IT systems
- Siloed communication
- Wrong people being alerted
- Unprepared for a crisis
- Disconnected workflows
- Repeating previous mistakes
Building a Culture of Urgency and Availability
High availability is essential to business success—an issue complicated by the increasing deployment demands of a highly competitive market. Accordingly, investing in processes to ensure near-zero downtime alongside rapid deployment is mission critical for the entire engineering and IT department.
Here, we break down how incident response is key to maintaining a culture of availability without slowing down the innovation process—and how DevOps is the essential piece for successfully executing this shift.
The Negative Economic Impact of Downtime
For the fortune 1000, the average total cost of unplanned application downtime is $1.25B to $2.5B annually. The average hourly cost of an infrastructure failure is $100,000 per hour. The average cost of critical application failure is $500,000 to $1 million per hour.
These aren’t outliers limited to the enterprise. Outages (and their subsequent costs) affect companies large and small. These types of errors are full of negative externalities, including branding and overall customer trust. For example, in 2017, GitLab lost a massive amount of customer data after an error (and subsequent failures of multiple redundant backup protocols). Customer projects, comments, and other data were all gone. While source code repositories were safeguarded, it was problematic for a company whose business involved data stewardship.
In the VictorOps 2017 State of On-Call Report, we learned 56% of respondent mentioned revenue impacts as the biggest negative result of downtime in their business. Of course, downtime is more than just revenue, the repercussions of a major outage are felt throughout the business.
$100,000/hour: Hourly cost of an infrastructure failure $1.25B to $2.5B: Total cost of unplanned application downtime $500,000 to $1 million: Average cost of a critical application failure 56% Mentioned revenue impacts as the biggest negative result of downtime
Competitive Advantage of Minimal Downtime
More advanced companies use historical incident data to proactively prepare teams to resolve events faster, and to prevent those events in the first place. This in turn becomes a competitive advantage as highly functional “on-call” teams help protect revenue loss, maintain brand reputation, and drive customer satisfaction.
Recent research demonstrates these high performers are deploying 46x more frequently, with a 440x faster lead time from commit to deploy, all while maintaining a mean time to recover (MTTR) that’s 96x faster. And change failure rate? It’s 5x lower, so changes are as likely to fail*
Shift from ITIL: DevOps and Modern IT
The traditional Information Technology Infrastructure Library (ITIL) model was developed in the late 1980s, a time when people were shipped physical disks for application updates. And while not every company then was in the business of selling software, almost every business now relies on running software and delivering online services. Software is disrupting every industry—entertainment, agriculture, finance…* This is where ITIL falls flat. ITIL separates duties and process approvals in an effort to support standardization and reduce duplication of work. This siloed and process-laden approach inherently slows down change. Nevertheless, many organizations still rely on this model, expecting to adhere to SLAs and maintain near-zero downtime despite incredibly rapid deployment demands.
In order to drive innovation, maintain uptime, and support employee growth, ITIL won’t hold up in the always-on, 24/7 IT paradigm. Accordingly, we advocate for a DevOps model as a cornerstone of incident response.
DevOps is an approach to work where teams continuously look for methods to evaluate and improve technology, process, and people as they relate to building, deploying, operating, and supporting the value our organization provides. It’s a broader shift in mindset that leads to addressing the needs of the business through the lens of the customer. We accomplish this through an increased focus on collaboration, measuring and improving processes, getting customer feedback, and improved transparency.
Bring DevOps Into Your Life
Benefits of DevOps + Collaborative Incident Response
Combining DevOps with a forward-thinking incident response tool means the end of a sh*t on-call experience.
For Ops: On-Call That Doesn’t Suck
- Collaborate with developers behind the code
- Ditch the shared pager — ackknowledge and resolve from your own mobile device
- Integrate across your toolchain (monitoring & more) for centralized information
- Access the context you need, quickly — no vague 2 a.m.
- Improved alert speed to deploy quickly without sacrificing safety or efficiency
For Devs: Owning Your Code
- Empower development teams
- Create more stable perating environments
- Spend time building and innovating — not fixing and maintaining
- Improve overall quality of your code
- Support ownership and accountability, regardless of role or title
For the Business: Increase Efficiency and Boost the Customer Experience
- Stay ahead of the competition
- Limit downtime & improve service quality
- Increase productivity — and happiness — of IT staff
- Drive quality communication across teams
- Increase overall organizational velocity
Modern On-Call Incident Life Cycle
Today’s teams must manage incidents across the entire lifecycle — folding in detection, response, remediation, analysis, and readiness. In this section, we’ll dive into the five different phases of the incident life cycle.
For each stage, we’ll cover the definition. Then, we’ll discuss how they relate to the features and functionality you need in on-call and incident response software to do more than react to alerts.
Stage 1: Detection
Detection is the observation of a metric, at certain intervals, and the comparison of that observation against an expected value. Monitoring systems then trigger notifications and alerts based on the observation of those metrics.
How It Relates to Incident Response Software
Simply put, detection is monitoring insights, looking for the signs and signals of an incident.
However, in organizations with legacy monitoring configurations, actually improving detection is tough. Environments are configured with broadly applied, arbitrarily set thresholds. The impact on on-call teams is measurable:
Too many false alerts + Too many interruptions = Acute Alert Fatigue
For the above reasons, high-performing teams focus on two things in addition to the basics. The first is time series analysis in their monitoring and detection systems. For example, some progressive, in-market solutions offer a time-series database, enabling wide adoption in both new projects and within existing environments. Your incident response tool should be able to seamlessly integrate with advanced monitoring tools to improve measurement fidelity.
The second is an accurate feed of what’s happening in your environment. In VictorOps, we call it the “Timeline.” A timeline provides continuous data from across your ecosystem as alerts flow through the system, providing a broad, holistic picture for the size, scope, and urgency of any given alert at any given moment in time.
Stage 2: Response
The response phase is the delivery of a notification to an incident responder via any means and the first steps the responder takes to address the alert. Thus, a detection threshold is passed, an email/SMS/chat/phone call is sent (notification), and someone acknowledges receipt (response).
How It Relates to Incident Response Software
There are a few key features to ensure the response happens effectively. You can think about these features as on-call essentials or, depending on how thin the feature set is, “basic alerting.” Thus, the leading incident response tools in market will offer:
- Dynamic scheduling
- Team-specific rotations
- Automated escalation(s)
- Scheduled overrides
These feature sets are essential, yet in isolation, they’re simply not robust enough to support a true DevOps culture. High-performing DevOps teams tend to focus on less reactive environments, investing in the people, process, and tooling to ensure teams are proactively preparing, minimizing, and preventing incidents. Accordingly, every second during response provides an opportunity for improved reliability and uptime.
This is an important point: Developers will not positively respond to (read: adopt) a highly-reactive on-call management tool. The tool needs to offer context, collaboration, and visibility.
Many high-performing teams have found success through ChatOps tooling and workflows that centralize communication and setup the first responder for success. While receiving a basic notification in Slack/Stride/Mattermost is great, a contextual alert with a visual indication of the current state, plus links to relevant runbooks or dashboards, saves the responder valuable time digging into the error.
When purchasing an incident response tool, buyers should look not only for bidirectional chat integrations and ChatOps functionality but also the ability to configure alerts to fit team needs—any information present in the alert payload can be used to provide additional details to the on-call responder. Straightforward contextual details attached to each alert will reduce the stress of on-call and provide a next-level technique for resolving incidents faster.
Stage 3: Remediation
Remediation is the true “firefighting” stage of incident response, where teams aim to quickly diagnose and solve the problem.
How It Relates to Incident Response Software
A variety of factors impact the length of the remediation stage, often a combination of severity and unknowns. However, the severity of the incident is, of course, often the most direct correlation to MTTR. This “severity” factor may leave teams feeling like the overall time to repair is outside their control; however, there are a variety of ways the combination of incident response software, processes, and team can put the control back in their hands.
The first piece depends on contextual alerts: what data does the team have access to and, perhaps more importantly, do they have the ability to understand the real-life implications of the data. Contextualization of data allows teams to turn metrics into actionable insights that provide a higher fidelity picture of the incident.
Incident response software can act as a black box for time-series systems (e.g., InfluxDB), log analytics systems (e.g., Splunk), and changes to production (e.g., Jenkins, GitHub).
Regardless of your specific approach to these metrics, your incident response ought to support a holistic picture of your systems and data. Robust integrations, contextual alerts, and runbooks attached to alerts serve as a collective knowledge base for dealing with a variety of issues, no matter your role or tenure.
Stage 4: Analysis
The analysis phase, often referred to as postmortem or post-incident review, is the learning process after an incident is resolved. While the historic approach to this phase has relied heavily on Root Cause Analysis (RCA), increasingly complex systems have led progressive teams away from relying only on single causal entity analysis. Instead, teams are increasingly looking towards models that address system complexities, e.g. Cynefin, to better understand the holistic, multi-faceted cause of an incident.
How It Relates to Incident Response Software
When we discuss analysis, there are a few key pieces necessary for incident response software to support a healthy Post-Incident Review (PIR). The first is the the Incident Dashboard or Timeline, which is helpful for providing a quick view of misbehaving systems before and during the incident; who shipped something to production; who was taking action; what actions was that individual taking; and what communication was happening throughout the incident. All of these pieces serve as critical data for an effective PIR.
Close readers may notice some nuances to words we’ve chosen (or avoided) as we discuss incident analysis, namely “Post-Incident Review” and “root-cause analysis” (RCA).
Post-Incident Review is our replacement for post-mortems. You can learn more about our approach to the Post-Incident Review, including why it’s so essential for DevOps teams—here. The decision to not use RCA mirrors this sentiment based on the current complexity of people and systems.
The second is also reporting related: Mean time to acknowledge (MTTA) and mean time to resolve (MTTR). MTTA/MTTR reporting allow your teams to visualize and uncover the underlying trends regarding a team’s ability to respond to and resolve incidents. By wholistically analyzing the impact of incident volume — and your teams use of the incident response software — you can determine levers to lower MTTA/MTTR specifically and minimize the cost of downtime.
The third is a Post-Incident Review—different than the actual process of an internal PIR, this PIR is a tangible report where individuals, including Leadership, can quickly pull a timeframe of data (no more manual aggregation of emails, Slack, SMS, and monitoring systems) for key learnings. This report facilitates a PIR, or “retrospective”, and documents long-term action items. Out-of-the-box PIR reporting allows your team to quickly and easily access monitoring data, system actions, and human remediation to better understand the who, what, when, where, and why of an incident. All of this analysis is essential for the preparedness and readiness required for teams to not only quickly resolve incidents in production, but also improve the reliability of systems to proactively address issues before they occur.
Stage 5: Readiness
Readiness, the next logical step, is the phase where teams take action to enact improvements to people, process, and technology in order to prepare and, as much as possible, prevent future incidents. Actions taken during this phase vary from architecture and application changes, creating and updating runbooks, or Game Days.
How It Relates to Incident Response Software
Readiness is the full package of incident response software. As you review the various facets of your team, from systems to processes, does your software enable your team to proactively, collaboratively, and seamlessly address incidents to lower MTTA/MTTR—and minimize the cost of downtime?
In practice, this stage can be the most difficult. Despite a team’s best efforts, action items are often left unanswered and day-to-day work supersedes suggestions and improvements. While response often expects full prevention of problems, high-priority projects somehow take the place of supporting these fragile systems.
Of course, one of the best ways to be prepared is to integrate readiness into the software delivery lifecycle (SDLC). Creating a culture where ownership doesn’t end when something is shipped into production is an essential piece of minimizing downtime. After all, what’s the point of DevOps if the dev team gets to ship something into production at 5pm on a Friday only to leave an Ops team firefighting all weekend long? While the two aren’t always complete causational (let’s avoid RCA), software releases are the single biggest factor contributing to downtime.*
Teams must find a way to incorporate reliability into releasing, and while you need the right people and processes in place, tooling can help. Look for an incident response solution that provides visibility into the SDLC via developer tooling integrations (e.g., Github, Jenkins). With this visibility, developers and ops alike have a holistic view of what’s happening across systems—including shipments to production.
Additionally, you should take time to optimize your alert structure, configuring alerts to meet a teams and organizations needs. A noisy alert system or “paging” system can leave teams fatigued and unaware of which alerts actually require action. At VictorOps, our Transmogrifier is our unique alert rules engine, empowering teams to set up a few processes essential to readiness in the face of the most important alerts. Here are a couple key configurations:
Alert Rules: Match behavior to fields in alert payloads and create cascading logic to meet often demanding automation needs. Noise Suppression: Using suppression and classification (either critical, warning, or info), unactionable alerts will be visible in Timeline and Reporting but won’t distrub users. Alert aggregation further reduces noise by bucketing related alerts into a single incident, adding even more intelligence to your input stream. Alert Annotations: Link alerts to relevant and helpful instructions, images, graphics, data, notes, or wiki-based runbooks to help responders have everything they need to quickly investigate and resolve the incident. Routing: Set up unique escalation policies in line with team needs and fine-tune. Kick off escalation
Incident Response Maturity
Beyond the stages of an incident, from readiness to resolution, there is a continuum of maturity for organizations and their overall approach to incident response.
Reducing Mean Time to Resolution (MTTR) requires strong collaboration and feedback loops between delivery and operations teams.
This culture of learning is fundamental to modern incident response and excellent DevOps practices.
Questions to Ask Before Purchasing a Solution
Here’s the thing: The majority of incident response tools on the market address the basics of “alerting.” These basic feature sets, i.e., enriched alerting, on-call scheduling, broad integrations, and varied notifications methods are all standard features.
During the evaluation process, buyer’s should think about the next level of feature sets aside from basic functionality — essentially, you want to invest in a platform that continues to advance beyond alerting, building features that support a culture of high availability (reduced alert noise, improved uptime and SLAs, a culture of near-zero downtime) as well as DevOps standardization.
Perhaps most importantly, you want to look for software that treats you like a human being, i.e., being on-call shouldn’t crush your soul. In today’s connected workplace, most people don’t work 9-5 anymore. For employees, that have to answer an on-call page in the middle of the night, you’d like to know the software (and the people behind the software) have you backs. Take a look at the support team for the incident response solution you’re evaluating and determine if they have a progressive, user-first mindset. Do they build features for the user or the CEO? Do they care about your experience waking up to an outage at 2:00 AM?
Your software needs to do more than check boxes, it should make your on-call life not suck while simultaneously growing and scaling along with the organization.
These are the most important questions to ask of your solution:
Questions for on-call management
Will I find contextual alerts with abundant information for resolution? Does the tool have built-in automation to reduce noise and alert responders only during critical incidents? Does the tool support collaboration with bidirectional group chat integrations? Does the software support international notifications? Does this tool support/integrate with my existing critical toolchain components? Can I access a variety of reports, including MTTA/MTTR and overall incident frequency? Is there a native mobile app that supports on-the-go on-call? How easy is it to conduct a thorough post-incident review? How hard is it to access historical data? How can I configure alerts? Are there varied levels of user permissions Do I have SDLC visibility to see when things are shipped to production
Questions for DevOps teams
How likely is it that my development team would use this tool? Would they find value in alerting? Or, would they simply be inundated with noisy alerts that make on-call miserable? Does this tool prepare me for continuous learning and continuous improvement? Can I access out-of-the-box performance metrics to report on SLAs and uptime? How easy is it to conduct a thorough post-incident review? Does this tool surface when new code is pushed into production? Is this tool build for DevOps standardization? Or would we need to migrate to a new tool as our team progresses?
VictorOps is Collaborative Incident Response. Unlike our competitors, our system leans into the progressive vision of DevOps — providing broad visibility, from deployments to production, to even the noisiest systems.
We centralize user activity for next-level event transparency, so your team can lean into the speed of DevOps.
Ready to see VictorOps end-to-end incident response in action? Sign up for a personalized demo with one of our product experts or go at it yourself in a 14-day free trial.