Find the Latest in DevOps & More in Our Learning Library Start Here.
Modern software development and IT operations depend on transparency, speed and collaboration more than any technical tool. While writing new code, building new features and deploying quickly to production becomes the main focus of most DevOps-minded teams, there’s a whole other side of the equation. How do you identify production incidents faster, determine the cause of the problem and mobilize the right people to fix the issue? That’s where real-time incident response and on-call collaboration comes into play.
Traditional IT service management (ITSM) adhering to the IT Infrastructure Library (ITIL) principles typically followed the practice of setting up infrastructure monitoring tools like Splunk and spinning up tickets in a service desk ticketing tool like ServiceNow. So, today, there are a large number of teams missing a tool or solution that bridges this gap in real-time and allows teams to take action on their data.
Cloud-based incident response tools will tie together everything from monitoring and alerting to documentation and communication. On-premises solutions are siloed from other tools and people – only creating value for the one team using it. While many ticketing tools are on-premises for security purposes, this can cause a lack of transparency for other affected stakeholders outside of IT. And, most teams are building highly complex hybrid cloud applications and infrastructure now – requiring a combination of cloud monitoring and on-prem infrastructure monitoring tools.
So, we wanted to layout why teams are leveraging incident response tools in the cloud and some of the benefits and concerns associated with cloud-based incident response solutions.
First off, there are no on-premises solutions for real-time incident response and on-call scheduling today. But, there are on-premises ITSM ticketing tools and monitoring solutions that need to play nicely with cloud-based incident response platforms. The old way of managing tickets and fixing issues simply doesn’t work with the velocity of today’s programmers and IT teams. CI/CD and continuous testing across numerous systems and services are forcing teams to adopt DevOps principles and become more proactive about incident management.
Whether you choose an on-premises solution or a cloud-based tool will depend on your organization and the systems you’re responsible for. But, in general, you’ll find that a cloud offering is often more efficient and effective than on-premises incident management. In the next section, we’ll go over some of the benefits of cloud-based incident response and some areas of concern.
The pros and cons of cloud-based incident response often go hand-in-hand with the way your team builds and deploys services to production, as well as the type of services you maintain. If you’re operating in a closed network and work with highly sensitive customer data, it may be better to leverage an on-premises solution. However, you’ll find, even when using a cloud-based incident response tool, you can control the flow of information in and out of your service desk or monitoring tool and ensure only applicable, compliant data flows into your cloud solution.
So, let’s talk a little bit about the benefits associated with incident response in the cloud and why teams are moving away from the traditional ITIL mindset of ticket management and monitoring.
From the initial implementation, cloud solutions are easier to access and configure. Also, they create more transparency across users and business units faster. And, while the cloud-based incident response tool is faster to set up, it also makes incident detection, triage and remediation faster. Updates are made to the cloud-based SaaS product more often and will offer you more value in real-time with smaller, frequent releases and bug fixes. It’s much easier to centralize information across distributed systems and teams into a single source of truth when managing incident response in the cloud – leading to lower mean time to acknowledge and resolve (MTTA/MTTR) metrics while making end-users happier.
The flexibility of cloud-based incident response tools helps you to easily change on-call schedules, adjust alert rules or update escalation policies. As your business grows and you add more and more users into the tool, the tool also expands and scales with you. The scalability and flexibility of a cloud-based incident response tool aren’t specific to just technical systems or personnel changes, it’s all-encompassing. As you ingest more alerts and add more users, create more complex escalation paths and integrate more tools, you see no loss in performance speed or capability.
IT operations and DevOps teams are often, rightfully so, concerned with data privacy and security when working with cloud solutions. So, the best part of managing incident response with a cloud-based tool like VictorOps is that it’s easy to configure what data flows into and out of the tool. So, if you have sensitive data in your service desk, you can create rules to exclude that information from flowing into the incident response tool without stopping the flow of critical alert data into the proper on-call team. The speed and flexibility of these customizations will allow developers and IT engineers alike to make changes quickly while ensuring compliance and facilitating a system for rapid incident response.
You don’t need to wait six months to a year to get key updates to product functionality. If there are minor bugs or functionality that interferes with your team’s optimal workflows, those fixes and updates can often be pushed out within days or weeks, not months. In cloud-based tools, frequent small changes are often made that eventually lead up to major product enhancements – but at least you’ll see incremental value along the way.
If you’re using an on-premises tool, there’s a good chance that it won’t play nicely with a number of other integrated tools – on-premises or otherwise. Highly integrated products simply can’t test out all of the possible use cases with connected systems and ensure they always work well. But, cloud-based incident response offerings can easily identify issues with integration partners and fix those problems in real-time. Then, users who rely on all of these different monitoring, communication and service desk tools (cloud-based and on-premises) can centralize this information in one place and work cross-functionally to fix problems faster.
Oftentimes, DevOps engineers and IT teams are concerned with the uptime of cloud offerings vs. on-premises solutions. But, connectivity and uptime of cloud software are generally more reliable than they’ve ever been. With more and more failover options and built-in redundancies, cloud-based tools are highly reliable – and you need them to be. If there’s an issue, it’s unlikely to cause a major impact to customers. And, on-call teams are more equipped with the information they need in order to remediate an outage when something does come up.
Because cloud services are, well, in the cloud, technical support agents are able to access your systems and fix problems with you more effectively. On-premises solutions are harder to troubleshoot and diagnose what a customer’s issue actually might be. So, if you detect bugs or simply need help configuring the incident response tool, support teams are easier to interact with and more effective at finding solutions.
If you’re unconcerned with incident response speed or scalability, only using on-premises ITSM ticketing services like ServiceNow might be right for you. This system will be secure and will provide comprehensive documentation. However, be warned that incident resolution time will be slower and the flow of work from monitoring tools to service desks and vice versa will likely be interrupted or impossible.
Concerns around uptime, configuration or security shouldn’t be blockers from looking into a cloud-based incident response tool though. You’re in charge of your applications and infrastructure and the flow of information through your system. If you’re at all worried about sending certain data into a cloud-based incident response tool, you’re not required to do it. But, serving actionable context in a collaborative way to the right on-call teams and users, when they need it, will be much simpler in cloud-based incident response software.
You’re not replacing ServiceNow, Jira or BMC Cherwell with real-time firefighting and on-call management tools like VictorOps, you’re supplementing it. Instead of focusing on the ticket or digging through monitoring tools to find the context you need, it’s served directly to the right person at the right time. While multiple users communicate around monitoring data and find solutions to problems, tickets are automatically updated – helping you maintain documentation without all of the manual work.
Incident response in the cloud shouldn’t be something you fear, it should be something you embrace. Creating more visibility into service health and notifying the right people faster should be at the forefront of any efficient DevOps or IT team’s priorities. End-users expect constant reliability and performance. And, the only surefire way to ensure this is through proactive incident management practices and a system for real-time incident response.
Learn more about the benefits of a cloud-based incident response tool in our latest guide, A DevOps Guide to Incident Response Software. We won’t only cover the ins and outs of incident response software itself, but we’ll also look at some organizational practices that can make on-call suck less while also reducing MTTA/MTTR.