VictorOps is now Splunk On-Call! Learn More.
Many times, when an incident occurs, the system will automatically notice and resolve the incident on its own. But, engineers will still need to get involved on a number of occasions. Even if your system can’t fix the problem automatically when an application or system breaks, your monitored metrics will be thrown off and the system will know to alert the on-call team. The general process for being on-call can be broken down into four main components:
Obviously, you need to know when there’s a problem with your system. First and foremost, you need to ensure that alert notifications are working and being sent out when an application or system fails.
Next, incidents need to be routed to the proper team in a timely manner so the incident can be acknowledged quickly and accurately. Your system should be set up in such a way as to notify the correct on-call engineer(s) as soon as it detects an anomaly.
However, your application or system may not always route the incident correctly and could notify a person or team that can’t handle that specific request. That’s why your incident management system needs to include re-routing functionality. Even if the wrong party is initially notified, they can easily transfer the incident to the proper team(s).
Alerts that come with real-time metrics, logs, and other incident report information are the most beneficial. Implementing incident notification tools which create detailed alerts will provide crucial visibility into an incident. In-depth reports and a responsive notification system will allow on-call teams to triage and diagnose problems more effectively. Visual representation of an incident through charts and graphs is helpful, as well as including detailed system notes via logs, runbooks, and annotations.
Triaging the incident must be done quickly. The on-call engineer(s) involved in triaging will not be responsible for solving the issue or even pinpointing the exact problem. However, they do need to isolate the incident and understand, generally, where the solution might be found. While solving this, the engineer(s) also need to determine what other systems might be impacted, how severe the issue is, and the team(s) that need to be looped in to fix the problem.
If the on-call engineer can fix the issue, that’s great. But, it’s not their prime responsibility. Providing transparent, detailed incident information up front makes triaging an incident much simpler. At the end of the day, the first-alerted engineer is simply responsible for understanding what’s happening at a high level and getting the right people involved to solve the problem.
So, having a solid set of communication tools is also imperative for the effective triage and coordination of incidents. Adopting some form of ChatOps will allow your on-call teams to communicate more effectively, and ultimately, drop an incident’s MTTA (mean time to acknowledge) and MTTR (mean time to resolve).
Once the incident has been triaged, the proper engineers will be working on the issue, and they will be working to alleviate the impact that the incident had on the system. The on-call mitigation engineer(s) will not necessarily fix the problem, but they will make sure the system is functioning smoothly enough as to not further affect any customers or external users. At this stage, your system as a whole will be functioning, and the incident should no longer impact any other integrated system functions.
Every DevOps and IT department will tell you that mitigation starts with stabilizing the system as quickly as possible. Engineers will not focus on finding the root cause of an issue because it’s more important that they simply bring the system back up as soon as possible. For example, simply rolling back to a previous deployment may quickly fix the problem and then you can assess what happened after the fact.
Every second of downtime adds lost opportunities, lost revenue, and potentially poor customer experiences. However, the ultimate goal of mitigating an incident is to find a resolution, and in the end, to improve your system.
Your system is stable, albeit not completely fixed. Now that the system has been stabilized, your on-call engineers will be able to investigate the root cause of the incident and determine the best way to solve the problem. Engineers will meticulously comb through the data and make sure that all KPIs have returned to normal.
You will know that the issue has been resolved when the initial indicator of the incident has stabilized and the system is functional again. Every team determines the finality of an incident differently, but it boils down to making sure that you have a solution in place to prevent the incident from ever occurring again.
Want to read more about what happens after resolution? Here’s a great example of a real VictorOps post-incident review and root cause analysis from our own CEO, Todd Vernon.
Being the on-call engineer(s) when an incident occurs is scary. Having the know-how to act on an incident quickly and an intelligent process in place for managing your incidents will help. Your on-call team will feel more prepared for the incident and will, therefore, be able to handle the issue more effectively. Implement tools that will help your on-call engineers to efficiently communicate and work collaboratively on an alert. Trust your processes and do the work to mitigate the incident’s effects on your system immediately. But when you’re done, do what you need to do to prevent the incident from ever happening again.
Sign up for your VictorOps free trial to see how we make on-call suck less.