If you’ve been working in Operations in the past couple of decades, and especially if you work with open-source technology, then Nagios will need no introduction. I’ve personally been relying on Nagios to keep me employed ever since it was called “NetSaint” in the late 90’s, and it has grown and evolved by leaps and bounds since those early days. When we were starting out at VictorOps, we knew that Nagios was going to be one of our most used and most important integrations, so it was the very first one that we shipped.
A good open-source project will make it easy for users to extend the functionality of the software, and here’s one of the places where Nagios really shines. The software allows you to use custom code for local and remote service checks, notifications, event handlers, and elsewhere. Our Nagios integration features a plugin that fits right in to this framework, providing two-way integration with the VictorOps platform in a way that’s broadly compatible with different versions of Nagios.
Virtually all alert management platforms integrate with Nagios, but some expect you to leave your Nagios dashboard behind and rely exclusively on their GUI for an up-to-date view of your platform. We went out of our way to work within the Nagios ecosystem, and make sure that the view you see from the VictorOps portal and the view you see from Nagios is consistent.
When you install the VictorOps Nagios plug-in, several resources get defined in the victorops.cfg file. A contact, called “VictorOps” becomes the specific target for your alerts in Nagios. The “notify_victorops” command definition provides the method for queueing alerts to the VictorOps platform.
The plugin defines several services in Nagios as well, and installs a process to forward alerts to our platform. The service checks ensure that the forwarder is running, send a heartbeat to the VictorOps platform, and poll VictorOps for commands to be passed back to Nagios.
All of these moving parts provide important functionality:
– The forwarder is capable of detecting if it is unable to reach VictorOps, and will send alerts via “failsafe” email if that happens.
– The heartbeat service checks in every few minutes and lets us know that your Nagios server is running and connecting to our platform
– The command poll service allows us to send acknowledgements back to Nagios when you acknowledge an incident from VictorOps. This is one way we ensure a consistent view between Nagios and the VictorOps portal.
– Finally, a manually triggered “status resync” service will send the status of all your hosts and services from Nagios to the VictorOps platform. This is a good way to get back in-sync if your Nagios instance was offline during an alert-generating event.
All of this functionality requires only a few edits in a config file to work. As sophisticated as the plug-in is, we took pains to make it easy to install and set up. An experienced Nagios admin will be able to get it running in a few minutes by following our documentation. We even have features for organizations that run several Nagios servers, so you can tell which server produced an alert in your timeline.
[At VictorOps, we use Puppet to manage all of our infrastructure, including our Nagios servers. I was honored to speak at the 2014 Nagios World Conference about how we use Puppet to make sure that every host we build gets the right monitoring as soon as it’s provisioned. You can check out the presentation at the link above.]