While we all wish for 100% uptime and no bugs or errors, this is simply not the reality. The rise of continuous deployment and integration (CI/CD) has given way to greater potential for incidents and downtime. Effective on-call schedules and rotations, alongside alert automation and escalation, drives service reliability and rapid incident response.
In the current DevOps ecosystem, you can’t avoid downtime. So, you need to prepare yourself for incident response and find ways to proactively test your systems. Through a combination of on-call preparation and proactive SRE efforts, you’ll be able to maintain near-constant uptime for highly complex, integrated systems.
In this article, we’ll cover the element of on-call incident response for maintaining resilient applications and infrastructure. More specifically, how DevOps and IT teams are setting up on-call schedules and rotations to quickly identify, respond, and remediate incidents.
All five steps of the incident lifecycle benefit from well-developed on-call schedules and rotations – detection, response, remediation, analysis and readiness. Let’s take a peek at
Reliable systems depend on resilient people. The responsibilities of being on-call can be overwhelming – especially if you’re working on a chaotic, less mature system. Moving along the incident management maturity model – from a reactive process to a holistic, proactive system – will mitigate alert fatigue and make on-call easier.
Intelligent, quick-witted, resilient people are essential to building robust applications and on-call processes that don’t suck. Cultivating positive customer experiences and maintaining reliable services also depends on a company’s ability to provide employees with the on-call tools they need to be successful. A single-pane-of-glass view for on-call scheduling, alert routing, escalation policies and collaboration improves transparency for incident management. On-call responders can quickly see what’s happening, escalate issues to the right person or team and communicate in real-time to resolve the incident.
But, everything starts with optimized on-call rotations. On-call schedules should be set to provide 24/7 coverage for incidents and downtime. But, your on-call calendar should also be formatted in a way that spreads accountability even across the team, not allowing one single person to bear the brunt of on-call responsibility. In order to make on-call suck less for both internal and external stakeholders, let’s take a look at effective types of schedules and rotations.
A few techniques can be applied to improve the quality of life and operational efficiency of your on-call teams. Making on-call suck less is one part process and two parts culture. The way you develop teams and rotations is always about finding the balance between incident coverage and employee happiness.
One useful template for on-call calendars is the follow the sun method. Follow the sun refers to planning your on-call schedules based on the location of the people on-call. If you only have one location and few remote workers, a follow the sun rotation may not be right for you. If possible, this type of multi-site rotation is ideal for ensuring 24/7 coverage and limiting the times employees need to respond to an incident at 4 AM. A centralized incident management solution helps teams maintain follow-the-sun rotations, escalate issues when necessary and easily communicate – regardless of geographical location.
Of course, your applications and infrastructure need constant coverage. But, this on-call coverage shouldn’t fall on one person. Especially for small teams, it’s important that on-call responsibilities are distributed evenly across the organization – many times for both developers and operations teams. As time passes, you’ll begin to hone your monitoring and alerting tools to help identify the greatest weaknesses in your service. With this knowledge, you can find out if a few people are getting stuck with the majority of incidents and adjust on-call rotations accordingly.
Everyone across the entire engineering team is responsible for the building and maintenance of reliable systems. So, on-call should be approached the same way. Putting developers on-call creates deeper exposure to systems in production and accountability for writing good code. Spreading on-call responsibilities across the organization, emphasizing a culture of collaboration and providing the right tools will drive accountability for service reliability across the entire organization.
Accountability, code ownership and developer exposure to systems in production inherently creates reliability. IT operations and developers tighten the feedback loop and collaborate closely throughout the entire software delivery lifecycle to maintain maximum uptime and keep customers happy. A deep understanding of how systems are working in production allows developers to build future services and features faster – and more reliably.
DevOps gives way to an era where developers can’t throw code over a wall and let operations deal with all of the maintenance and upkeep. This process results in a lack of visibility and bad code being shipped by developers. When developers and IT teams share on-call responsibilities, it mitigates alert fatigue, cross-team animosity and allows you to build reliable services faster.
VictorOps is purpose-built for DevOps. By creating an integrated IT alerting and on-call experience, VictorOps brings IT operations and developers closer together. Sign up for your 14-day free trial and see why centralizing incident response, monitoring and alerting, and communication makes on-call suck less.