Finding the most effective way to manage incidents in your organization is dependent on two things: 1) The maturity of your product and, 2) the maturity of your incident management processes. Your ability to manage the entire incident life cycle depends on how reactive or proactive you’re able to be. DevOps fits into your organizational culture and incident response to improve overall incident management.
Many reactive teams are only able to manage the first three steps of the incident life cycle—detection, response, and remediation. But, implementing DevOps into your incident management can help you become more proactive. When the DevOps team collaborates from end-to-end, with improved visibility and system exposure, incidents are more easily remedied. Faster incident resolution allows developers to spend more time developing new features and building reliability into the system, not simply responding to issues.
A collaborative team can communicate about the best ways to detect pain points in the system. Once your team collaboratively identifies potential weaknesses or blind spots, they can implement monitoring tools and corresponding thresholds. Then, set and adjust alerting systems to notify on-call team members appropriately when they're experiencing ETL lag or a spike in CPU/disk usage, etc.
The key to detection is striking the balance between under-alerting and over-alerting. You want to avoid alert fatigue, but you don’t want to miss actionable incidents that may be hidden from your monitoring tools. Constant collaboration and continuous improvement by the DevOps team will allow you to keep iterating and optimizing your monitoring and alerting setup.
Normally, one single on-call person will acknowledge and initially respond to an incident. But, the on-call engineer may not always be the person with the knowledge to start fixing the problem. So, establishing a DevOps-based incident management solution will focus on the ability to route, escalate, and collaborate around issues. The on-call engineer can easily pull in teammates who need to be involved and prioritize the incident response quickly and easily.
The next level of incident response will add context to an alert when the on-call engineer receives it. That way, they can quickly and easily diagnose the problem, assess who needs to be involved, and escalate the issue appropriately. Being notified of a system alert with no context is only slightly better than not receiving an alert at all.
A DevOps team, with the proper tools and processes, can acknowledge an issue, re-route the problem based on contextual alerts, loop in the necessary people, and collaborate around the problem all in one place.
Now, the incidents actually need to get fixed. DevOps teams already have a leg up because they’ve had more exposure to code in production and should have improved awareness as to what the issue is. And, contextual alerts will help your teammates identify an issue more quickly. But, you also need to provide the team with the tools to quickly review logs, events, traces, and other applicable metrics.
Time-series databases, log analytics, and visualization tools can work together to give you deeper visibility into what might be going on. The collaborative DevOps team, because they’ve been maintaining and developing together, may also have more anecdotal knowledge around what might have happened.
Your incident management process will provide real-time collaboration options via SMS, phone, native chat, external chat applications, and a mobile app. This way, the full DevOps team can centralize information, collaborate around an issue in real-time, and provide data within the same centralized timeline.
Writing up a detailed post-incident review is a great place to start. Everyone on the DevOps team can chime in about what worked and what seemed difficult throughout the process. This opens up the floor to collaboration from multiple people from different areas of the organization (mobile, infrastructure, web client, middle-tier, data, etc.) and can bring weaknesses to light.
You can start measuring your preparedness from incident to incident. Ideally, your mean time to acknowledge and mean time to resolve an incident will continue to decline as you learn more about your system, tools, process, and people. And if it isn’t, you should be able to identify areas for improvement. Maybe you can optimize some alerts coming out of New Relic, or maybe you simply need to set up a monitor for a segment of your system that was previously unmonitored. The more you can get out of your post-incident review analysis, the better your future incident response and resolution will be.
By this stage in the incident life cycle, your whole team should be better prepared for an incident. You can set up runbooks or provide short triage instructions for a similar incident in case it happens again. The on-call engineer(s) and team members who were called in now have more exposure to the system or issue.
A DevOps culture improves cross-functional collaboration between steps one and four, leading to more confidence and readiness when future incidents occur. On the flip side, improved readiness makes everything easier from steps one through four. DevOps teams will continuously improve and work to shorten the incident life cycle each time by improving incident readiness.
DevOps gives team members early exposure to systems and makes it easier for them to fix problems when they arise. A collaborative DevOps team ensures a higher level of visibility and communication throughout the entire incident life cycle, from incident detection to incident analysis.
VictorOps incident management is purpose-built for DevOps teams looking to improve collaboration and reduce the time to acknowledge and resolve incidents. Sign up for a 14-day free trial to see for yourself how DevOps teams are centralizing incident management, improving visibility, and making on-call suck less.