VictorOps is now Splunk On-Call! Learn More.
How you build your application absolutely impacts the lives of those in charge of supporting it. This isn’t a correlation we generally make, but thinking about what happens when things break as you build your application will help everyone. Developers should be thinking about ways they can improve incident management and response through code, especially because more and more developers are on-call. In this post, I’ll explore how bulkhead and sidecar design patterns can do just that.
When considering ways in which your application assists in incident response and remediation, here are the attributes you’re looking for:
1) Does this design pattern give more context?
2) Does this design pattern make resolution easier?
3) Does this design pattern improve my ability to communicate issues?
4) Does this design pattern improve my ability to bring in experts?
There are a lot of modern design patterns for microservices-based applications. And, the two design patterns that I’m interested in right now, from the perspective of how they support application support, are bulkhead and sidecar.
The bulkhead pattern seeks to isolate applications and services into pools of resources. Such isolation allows some amount of failure to exist without bringing down the entire application or creating cascading issues. Bulkhead design patterns have the inherent benefit of making services easier to understand and decompose team-wide. But, the below benefits are also provided as they relate to supporting the application in production:
The isolation provides more context in the alert payload to better pinpoint the issue. It also allows responders to address issues in a way that does not impact the reliability and uptime of other functionality in the application.
If a human needs to be in the loop, pools can relate to teams, and teams generally serve as buckets for subject matter experts and alert destinations. So, the pools can assist in determining who/what to alert and who might be an expert to add as a responder. This is especially true when developers are expected to be on-call for their code. When there’s a failure, whoever is paged can use the pool as an indicator for which currently on-call developers are relevant to address the issue versus reaching out to anyone tagged as a backend or frontend developer.
When I imagine the sidecar pattern, I think of something a little more parasitic. This pattern is a great way to keep complementary components attached logically but separate technically. What this does for application support is:
It prevents service-related code from taking down the primary function of the service itself. This allows for rollbacks on the service, or the service-related code, independently so as to not impact each other. However, it gives the service and companion app a direct connection to make it easier to consume i/o from one another. For example, perhaps a service has a platform extraction layer entry point. This layer could be separated such that if the layer becomes unresponsive due to high-load, users of the service aren’t directly impacted.
Besides abstractions, sidecar is often used for monitoring tooling for that service. This has the benefit where issues with a monitoring tool cannot impact application functionality in the service they’re attached to. But, it also gives incident responders the benefit of not losing access to data from the monitoring tool if the services do come down. This is a large shift from most architecture, where even monitoring for microservices applications is done in a monolithic way. Monitoring tools also often have agents and can benefit from being tightly coupled with the application. You don’t want those agents to bring down the service if they have an issue. But, you do want there to be data coming off the service and accessible as long as the service can produce it even if it’s not functioning.
And, like the bulkhead pattern, the isolation is both logical and technical. The logical benefit is that responders can stay focused on where the issue occurs and better bring in support where it’s needed; especially in cases where developers are on-call and need deeper access to monitoring tooling than would normally be the case for the broader application.
The list of modern design patterns is increasing rapidly, and it’s hard to keep up with. Many others have direct correlation to supporting the application and addressing failures automatically or manually. I’m a big fan of the sidecar and bulkhead pattern as tools to improve application production support.
Every architectural design and application development process needs to consider the likelihood of failure and the underlying incident response workflow. Sign up for a 14-day free trial of VictorOps to see how automating contextual alerts in a collaborative way is improving incident response for DevOps and IT operations.