VictorOps is now Splunk On-Call! Learn More.
One key rule in IT operations management includes knowing when (and why) something breaks and how to resolve those incidents. The best IT operations managers will completely control the incident management lifecycle – from real-time incident detection to future incident preparation. IT operations managers act as the essential connection between technical systems and the people who use them.
But, in the era of CI/CD and complex, cloud-based infrastructure, IT operations managers are constantly finding the balance between speed and reliability. Today’s IT operations managers are always striving for highly flexible infrastructure and operations involvement earlier in release management and deployment. With IT Ops input and testing throughout the software development lifecycle (SDLC), there are fewer incidents in production.
Exposure is one of the core principles of a truly DevOps-centric organization. IT managers and engineering managers shouldn’t be working in silos – they should be exposing each other’s teams to the pain points they feel. This way, developers write better code for production environments while sysadmins and other IT professionals can set up infrastructure and applications that reduce work in progress (WIP) and improve developer efficiency.
An IT operations manager’s number one responsibility is to not become a blocker to delivering features to customers. A prepared system for incident response and on-call alerting is the only guaranteed way to approach service reliability. IT operations should be laying a flexible groundwork for a system that encourages speed in software development and incident response – creating a culture of resilient CI/CD and a competitive team that doesn’t shy away from taking risks.
IT operations managers are maintaining the servers, networks and applications that keep businesses running and customers happy. But, they’re also managing a number of people on the IT team, making sure employees don’t experience alert fatigue or burnout. And, in IT operations, that can be quite difficult.
So, you can see that IT operations managers juggle numerous responsibilities across people, processes and technology. Effective monitoring and alerting practices can lead to full coverage of a system without creating a system of over-alerting – helping teams avoid blind spots and ensure more resilient applications and infrastructure without taking advantage of employees.
In this post, we’re focusing on alerting specifically. This should be used as a guide for IT operations managers looking to mitigate alert fatigue, ensure rapid software delivery, encourage faster incident response and manage more reliable architecture – all while keeping employees and customers happy.
IT operations managers should have a thorough understanding of the incident lifecycle. Breaking down the incident lifecycle into five parts can help DevOps and IT teams find ways to improve alerting and make real-time collaboration more actionable at each stage. So, let’s take a deeper look at each of the five stages and how IT operations managers are making on-call alerting suck less.
Most of incident detection relies on effective monitoring. But, if you aren’t being properly alerted when a monitoring threshold is surpassed or a metric crosses into “unhealthy” territory, have you really detected it? Alerting processes can leverage an intelligent rules engine and integrated on-call schedules to ensure the right person is alerted every time. You can also silence unactionable alerts or reduce repeating notifications to ensure on-call responders are focused on the highest priority issues.
But, once the on-call team is aware of the issue, how do they escalate the issue, communicate around the resolution and get the context they need? Speedy incident response thrives in highly collaborative DevOps environments where IT operations teams and software developers work together to triage, investigate and identify an incident’s source. IT operations managers should be focused on processes to help teams mobilize quickly and surface useful context in real-time alongside alerts. (e.g. runbooks, charts, logs, traces, metrics, etc.)
Once the team knows exactly what’s wrong and how they should fix it, it’s just a matter of time until they do. An IT operations manager needs to ensure that everyone in the organization, both developers and IT professionals, have the proper permissions and skills required to solve the issue that reaches them. If on-call alerts are reaching a person who can’t access the tool or service they need for incident remediation, then no amount of alert context can help them fix the issue.
After every incident, there should be a thorough post-incident review – analyzing the way people, processes and technology interacted to restore uptime to an application or service. What worked? What didn’t? What improvements could be made to the overall incident detection and response system that would have led to faster remediation? Without analysis, you can’t learn from failure and continuously improve the way DevOps and IT teams work together to maintain application and infrastructure availability and uptime.
Now that you know what can be improved, improve it. How can you be more prepared the next time an incident strikes? Preparation allows IT operations managers to live with more risk in the release management and deployment pipeline and encourage developers to ship applications and services faster. Some teams perform proactive chaos engineering exercises and tests to better understand how their system will hold up to stress. A prepared on-call team is the only surefire way for an IT operations manager to maintain resilient applications and infrastructure in a world of CI/CD.
Alerting and incident response live hand in hand. Once an alert comes in, the team needs to know exactly how to respond to the issue and how they can get the alert to the right person. Software developers can no longer throw code over the proverbial wall and force IT operations managers and their teams to figure out how to reliably package it, deploy it and maintain it. Both developers and IT professionals should be taking on-call responsibilities and accountability for the reliability of the services they build.
Holistic alerting refers to notifying the right person at the right time with the appropriate details they need to fix the issue. Collaborative incident response refers to the way teams interact during a firefight, how they share information and the types of DevOps tools at their disposal.
Combined, holistic alerting and collaborative incident response create a proactive system for addressing reliability concerns and fixing production incidents. You can’t have one without the other. Better alerting leads to more collaborative incident response and more collaboration will lead to better alerting. It’s a constant cycle of continuously addressing the way you alert on-call responders to problems and how your people, processes and tools work together in real-time to reduce MTTA and MTTR – keeping employees and end-users happier.
First and foremost, IT operations managers should encourage communication and transparency between all teams. Any way to improve transparency into monitoring and alerting workflows will lead to more collaboration. And, more collaborative teams will better understand how they can communicate and work together to fix problems when they arise. IT operations managers should gain more exposure to development workflows and developers should have more exposure to release management, deployment and maintenance operations. Try instituting some sort of “Study Abroad” program where developers and IT professionals can learn more about each other’s roles and see how they could better work together.
DevOps isn’t just a buzzword. There are real, actionable ways that IT operations managers can implement a culture of DevOps-centric collaboration and transparency – leading to happier employees and end-users. Check out our free eBook, Why DevOps Matters, to learn exactly how you can do this on your own team.