VictorOps is now Splunk On-Call! Learn More.
Chris Riley July 01, 2019Monitoring & Alerting
Exception monitoring and incident management are both examples of jargony terms that don’t mean what you might assume when you first hear them. Unless you know what an exception is in the context of IT and application development, your guess as to the definition of exception monitoring is probably way off. Similarly, incident management sounds like it could mean any number of things.
To clear up the confusion, keep reading for a primer on the meaning of exception monitoring and incident management, as well as a discussion of how they compare to each other.
To understand what exception monitoring means, you have to know the definition of an exception. In this context, an exception is any type of event that causes an application to fail to execute as expected. Exceptions can be caused by problems like an I/O error, a lack of memory or disk space, or an error within your application code.
Exceptions don’t necessarily cause an application to fail entirely. In many cases, applications are able to recover from exceptions. However, when an exception occurs, it means something — somewhere — went wrong within your application. If you don’t determine what it was and take steps to address it, you run the risk of having the problem escalate, leading eventually to a full application or system failure.
Exception monitoring, then, is the process of identifying exceptions within an application. More broadly, it also refers to the process of analyzing and addressing those exceptions. You might think a term like exception management would make more sense since exception monitoring is really about more than just monitoring for exceptions. But, that’s not the term people use.
Now we know what an exception is. Let’s define an incident.
In the context of incident management, an incident is any type of failure within an overall IT system that causes the system to stop responding or performing adequately. Incidents could be the result of an application problem (such as a memory leak that causes the host system to run out of free memory), an infrastructure issue (such as a disk failure or a server crash), or a combination of the two.
Incident management refers to the art and science of identifying and responding to these incidents. It also entails identifying problems before they actually turn into full-blown incidents that cause a disruption to an application service.
Exception monitoring and incident management share several key attributes in common. They are both important for maintaining positive user experiences. They also help to identify inefficiencies within applications or the environments that host them which can, in turn, reduce costs (especially in a cloud-based environment, where the more resources you consume, the more you pay). And, they are both essential for enabling continuous improvement.
That said, exception monitoring and incident management are fundamentally different disciplines. While exception monitoring focuses on identifying and fixing problems within an application’s internal flow, incident management deals with problems within the larger stack of software and hardware in which an application lives.
In addition, the specific tools and types of expertise required to perform exception monitoring are usually different than those that you’d use for incident management. For the latter, you can collect metrics from a variety of different sources, such as response time, I/O rates and network load. Correcting the incidents that you find doesn’t always (or even often) require you to look deep inside the application code. Much of the time, effective incident response entails simply modifying a configuration file or making more infrastructure available.
For exception monitoring, in contrast, your main data sources are typically log files that record information about aberrant application behavior. Interpreting that information usually requires admins to have at least some understanding of application code, since it’s within the application that changes need to be made to respond to exceptions.
It’s worth noting that exception monitoring and incident management are both different from software testing, which evaluates whether an application performs or behaves as expected before the application is deployed into production. Software testing is another discipline in and of itself.
Exception monitoring and incident management are both critical parts of a healthy software delivery process. You can’t substitute one for the other. Nor can you (in most cases) rely on the same exact tools to meet both types of needs.
See how teams are using VictorOps to centralize exception monitoring, incident management and real-time collaborative response in a single source of truth. Sign up for a 14-day free trial or request a free personalized demo of VictorOps to learn more.
Chris Riley (@HoardingInfo) is a technologist who has spent 15 years helping organizations transition from traditional development practices to a modern set of culture, processes and tooling. In addition to being an industry analyst, he is a regular author, speaker, and evangelist in the areas of DevOps, Big Data, and IT. Chris believes the biggest challenges faced in the tech market are not tools, but rather people and planning.