World Class On-Call & Alerting - Free 14 Day Trial: Start Here.
In ITIL, ITSM and traditional IT operations, at one point in time, the adoption of agile infrastructure and collaborative workflows seemed ludicrous. But now, in 2019, businesses of all sizes are adopting DevOps practices to improve service reliability and delivery speed. From startups to large enterprises, teams are learning how to improve organizational transparency and tighten the relationship between developers and IT operations.
The DevOps model allows teams to build resilient applications and infrastructure faster while simultaneously ensuring a better response to production incidents. Teams are continuously testing throughout the software delivery lifecycle (SDLC) and deploying frequently without incident. Then, when incidents do occur, the team is prepared to respond to the issue and quickly rollback a deployment or remediate the problem. DevOps and Agile work together by taking best practices from software developers and combining them with best practices from IT operations.
So, let’s discover what DevOps is and how both developers and IT teams are benefiting from the methodology.
DevOps isn’t a specific system or set of tools. The DevOps model is a process for improving the collaboration and visibility between IT operations and software developers – creating a lifecycle for continuous delivery and integration (CI/CD) of highly resilient systems. While many people think DevOps is about turning IT professionals into software developers, it actually works the other way too. Developers are learning more about the release management and incident management processes, giving them more ownership of the uptime and performance of their services.
The DevOps framework allows teams to build, test, release and maintain services quickly and reliably. Instead of going from one large deployment to another large deployment, engineering teams are able to quickly deploy smaller frequent changes and give customers value faster. In case of an incident, these smaller changes are easier to roll back and fix – leading to fewer reliability concerns and less downtime.
The core tenets of DevOps facilitate an organizational culture focused on continuous improvement. Effective DevOps practices continuously improve collaboration, transparency, exposure, accountability and automation throughout the entire process – making the entire lifecycle faster and more efficient. Because no two businesses are the same and every team is structured differently, there isn’t one specific way to implement DevOps. But, if the institution of DevOps is focused on those core values, then you’ll be successful in reducing development time without hindering service resilience.
Why should Agile software development principles stop with developers? Why shouldn’t Agile practices also apply to the way IT operations approaches your service’s underlying infrastructure? Now, DevOps is mostly thought of as a human-centric approach to tightening feedback loops between developers and IT operations. But, it’s still important to acknowledge the changing technological landscape and how it’s affecting team productivity. Microservices, serverless applications and cloud-based infrastructure have led to the common use and implementation of Agile infrastructure.
Also known as dynamic infrastructure, Agile infrastructure is about supporting IT hardware and networks that can respond dynamically to changing internal and external factors in the system’s corresponding software. When traffic spikes, how can the infrastructure stretch to account for the changes and ensure optimal performance? DevOps teams are looking at their infrastructure as a tool for security, scalability and reliability – not a blocker. Infrastructure should be used as a method for improving the collaboration between people, processes and technology.
But, more dynamic infrastructure and an efficient development process naturally leads to more change in the system. And, with more change, comes an increased likelihood that incidents will come up. So, after improving the SDLC, DevOps teams need to look at the maturity of their incident management and how they can be more prepared when disaster strikes.
Mature incident management teams are able to quickly identify an incident and take action to remediate it. DevOps-centric businesses are better at resolving incidents without major interruptions to the development pipeline. Implementing collaborative DevOps workflows throughout the SDLC will inherently help move a team further along in their incident management maturity. IT teams get deeper exposure to development and staging environments while developers get more involved in production environments.
So, let’s take a look at the stages of incident management maturity and how the DevOps model leads to a more effective system for incident response and remediation.
Reactive incident management happens when developers and IT teams work in isolation from each other. There’s little to no visibility into system health and performance, and there’s no outlined process for incident detection and response. On-call responders typically don’t have defined roles or responsibilities throughout the incident lifecycle. The DevOps team isn’t prepared with all of the monitoring, alerting and collaboration tools they need – making the team highly reactive when an issue pops up.
The team has adopted some tools and a basic template has been laid out for monitoring, alerting and incident response. Personnel roles are defined and there’s a plan, albeit simple, for alert prioritization and routing. The team knows how they should communicate during a firefight and what other teammates are expecting during incident response. While tactical DevOps teams are still mostly reactive to incidents, they at least have a plan when something does happen.
Where you really start to see value is at the integrated stage of the incident management maturity model. Teams in the integrated phase have started conducting post-incident reviews and analyzing past incidents to build resilience into their systems. Alerts coming from monitoring tools now have runbooks, triage documentation, traces, logs, charts and other context appended to them. There’s now an organized process for collaboration and alert routing backed by a cross-functional DevOps team who’s prepared for incident response.
And, last but not least, the holistic stage is a DevOps team’s ideal state for incident management. The team is proactively silencing alerts with automation and leveraging self-healing systems in order to reduce alert fatigue and prioritize critical incidents. The team tracks advanced incident management KPIs and metrics over time and uses automation and ChatOps to improve collaboration between people and systems.
Defined communication methods and visibility throughout the entire SDLC lead to better workflows and on-call that doesn’t suck. At this stage, the DevOps team continuously improves through post-incident reviews, comprehensive metrics and a better understanding of how people, processes and technology need to work together.
In DevOps, everyone owns the incident management process and the software delivery process. Developers can’t simply throw code over the fence to IT operations teams and expect them to deploy it reliably. And, the QA and IT team can now run continuous tests through staging and sandbox environments to ensure more consistency when shipping new services to production. In case an incident happens during deployment, the IT team can easily roll the deployment back or escalate issues to on-call developers.
Resilient applications and infrastructure depend on everyone. All of engineering is responsible for the delivery of highly reliable services. Incidents are inevitable in the era of distributed systems, containers and hybrid cloud architecture. Alongside Agile development practices and the speed at which change now occurs, the DevOps model is the only way to truly build resilient engineering practices.
Download our free eBook, Why DevOps Matters, to explore the full scope of DevOps and its impact on the software delivery and incident management lifecycle.