VictorOps is now Splunk On-Call! Learn More.
As technology iterates and advances, it becomes ever-more distributed. Over the past several decades, architectures have evolved from centralized mainframes holding a single monolithic application accessed by dumb terminals to microservices running in containers on public clouds, accessed by user interfaces running on handheld mobile devices.
The tricky thing about this radical change is that tooling doesn’t always move as fast as architectures. In other words, distributed systems have been the norm for a long time, but the tools necessary to manage and maintain distributed architectures haven’t evolved as quickly.
That’s finally changing, as new tools have hit the market in the last several years that are prepared to handle the challenges of microservices architectures. They provide one essential feature – context awareness – which makes all the difference when diagnosing incidents.
Let me explain why…
With any monolithic application, there are predetermined paths through the business logic, so if there is an error in a single subsystem, the application support teams know exactly what’s calling that subsystem and what data is always passed along.
Running an application built using the microservices architectural model can often result in the other extreme — when an error happens in any one of the microservices without some idea of what is calling the microservice and what exactly the call is trying to accomplish – then knowing if the request is valid (let alone reproducing the scenario resulting in the error) is very difficult.
By having even basic context available – which can include things like source service, what parameters were passed, and where in the service the error occurred – puts support staff well on their way to faster incident resolution.
Microservices bring a whole new level of scale to application deployments. But, containers running through orchestration engines like Kubernetes or Docker Swarm can make finding context exponentially more difficult — or easier, depending on your point of view.
Well, it’s now not just the sheer number of microservices that make up an application — you also need to be aware of the information that’s in a container, which can run on one of any number of hosts (almost randomly). Within the cluster, they’re destroyed as soon as they have a critical error or are shut down. There’s no going back to find the logs in any sort of timely manner.
Ideally, there would be a single tool to implement and it would do everything – but that just isn’t the case. Realistically, as companies move to microservices-based applications running in a containerized world, they need to invest in supplementary tools. Some DevOps tools on the market meet multiple needs in a single product suite but I have yet to find one vendor that addresses them all.
Whether a SaaS offering or on-premises, you’ll need a single system that can reach into your applications and systems to pull logs in near real-time. This allows logs to not only be available after containers are destroyed but also to be filtered and analyzed for errors before they cause actual customer-impacting outages.
Deep diagnostics are available from open-source projects like Jaeger, or from one of many APM vendors. These diagnostics record what is going on inside an application, and allow support staff to view what exactly an application was doing when an incident occurred, and potentially see the actual transaction that had the error.
Keep track of end users with real user monitoring to watch for response times and errors that real users are seeing. This is better for watching trends but can be valuable for real-time incident identification and response.
Monitoring system metrics including CPU, memory, disk, and network traffic. Prometheus is the tool recommended by the people that brought you Kubernetes but there are multiple other offerings that can fill the same need.
A centralized communication channel allows for a seamless conversation to talk about and review data about an incident. Tools like Slack are incredibly popular as they have integration hooks to pull data from not only incident management systems but APM, monitoring and centralized logging as well.
A centralized incident management solution will help show which assets are related to which products so the right teams are alerted, and everyone has the information delivered to them in a timely manner. Nothing is worse than having to remember who supports a system and then tracking down his or her phone number at 11:46 pm – 14 minutes before the start of Black Friday.
Supporting ITSM incident response processes or procedures in any modern application environment becomes easier when you can streamline information collection and collaboration through a centralized incident management solution. These solutions allow support staff to retrieve relevant information quickly, communicate with the required teams and build the context they need to immediately start resolving the incident, which will improve overall response times and closure rates.
Sign up for a 14-day free trial or request a free personalized demo of VictorOps to see exactly how you can surface context faster, collaborate better and make on-call suck less with a centralized incident management solution.
Vince Power is a Solution Architect who has a focus on cloud adoption and technology implementations using open source-based technologies. He has extensive experience with core computing and networking (IaaS), identity and access management (IAM), application platforms (PaaS), and continuous delivery.