This was provided by Dave North, Director of DevOps at Signiant. He credits VictorOps with improving his team’s workflow and offered to write up how they’re using our service by integrating with their existing tools. Operationally, they run components of their SaaS offering in both AWS and Azure and use a variety of systems to deploy to, and monitor, the various service components. Read on to see why they call VictorOps their savior…
Prior to VictorOps
Prior to implementing VictorOps, we received alerts primarily from a Nagios deployment using email notifications. As we grew to more services though, we found we were also receiving alerts from Papertrail (our log aggregation), AWS CloudWatch and Pingdom. Trying to redirect these alerts appropriately was an almost impossible task – especially as we added more members to our team.
Enter our Savior – VictorOps
After evaluating VictorOps and a couple of other on-call management solutions, it quickly became a “no-brainer” that VO was the right solution for us. The timeline functionality was one of those “why is no one else doing this?” features that just seems so obvious yet only VO had implemented it in a useful manner. Working with the VO technical team, we very quickly moved all our alerts over to VO, added some schedules and haven’t looked back.
So, what have we integrated?
We used the standard VO integration to hook into our Nagios servers which is where 80% of our alerts originate. Using the VO Nagios integration, this was a straightforward job to funnel alerts to VO and route accordingly. We also used the standard CloudWatch integration to pull in alerts from AWS CloudWatch and from Pingdom.com. We also used the standard integration to hook into Slack and created a “war room” channel where we can have a bidirectional integration with VO.
The more interesting side came when we wanted to do some custom integrations though or extend an existing integration. On the custom side, we created 2 main ones that are used every single day:
A custom PHP module (which we have open sourced) that can be integrated into any PHP application to send notifications to VO
In our case, this is integrated with our main tool that we use to direct traffic to a live or standby environment. VO gets a notification at the INFO level whenever someone changes the traffic routing between environments. This shows on the VO timeline so if something happens that generates a page out, the on-call engineer can see in the timeline that something was changed prior.
A simple Jenkins plugin to integrate into our promotion (release) process of our code
This notifies VO whenever someone releases new code to production and, like the PHP module, allows us to see on the timeline at INFO level when new code was released.
Automatic addition of Cloudwatch alarms for Amazon ECS services
Here, created a standard AWS Cloudformation template for deploying Microservices to the EC2 Container Service with alarms on CPU and Memory capacity that are automatically created when a new service is added with the correct VictorOps routing key. In this way, the teams creating the services really don’t have to do anything to hook into our alerting system (VictorOps) and can focus on creating their service.
Apart from the many integrations VO provides, one of the key features is the Transmogrifier. This allows us to dynamically adjust alerts as they come into VO and re-route them or, more interestingly, annotate them. In our case, we have rules that dynamically annotate alerts with 3 main fields:
A link to the AWS or Azure status page
An icon which dynamically changes colour if a cloud vendor has reported an issue on their status page (ie. it will be red if there is a problem with the underlying infrastructure provider). It may not be related to the alert but it at least gives us an indication there may be something happening upstream.
A direct link to the Papertrail logs for the service that is generating the alert. Makes it easy for us to get the logs for “this” service.
We’ve also used the transmogrifier to re-route some alerts which is handy because it means we have not had to go back to the alert sources and re-configure things there.
After using VO for a few months for our DevOps (ops) team, we were so impressed we recommended it to our customer-facing support team who also then adopted VO as their on-call management solution.