Titans Group develops and distributes value-added services in 17 countries through cable providers and mobile carriers. Its infrastructure team supports many development teams, many product lines, and a lot of alerts. Infrastructure and Operations Manager, Caio Wendel, sought a better way to monitor systems and alert the team, especially on nights and weekends. That’s when they found VictorOps…
“VictorOps is on-call management with a focus on actually solving the problem.”
“You don’t need to hire a NOC – just get VictorOps.”
I don’t want to pay someone to watch my systems all night while they play video games. With VictorOps, I don’t have to pay the night shift or overtime. I don’t need to have a person or a team sitting there monitoring our systems 7 days a week just to look at one string. VictorOps saves us money, and it’s much more efficient because it alerts the right person quickly.
Training the team to think beyond the quick fix
Before VictorOps, we relied on Nagios for our alerting. It was set up to send email alerts and in some cases SMS messages. But it sent so many messages that the alerts – even SMS messages – were sometimes ignored. We decided to bring on VictorOps for a more sophisticated alerting process that included phone calls. Now we integrate VictorOps with Nagios, Sensu, Amazon and Pingdom.
As soon as we incorporated phone calls to on-call team members, we drastically reduced false positives. Our team knew that if they just did a quick fix, they would be called again – in the middle of the night or on the weekend. Now people know that if the phone rings, they have to fix the problem once and for all.
We used to have problems that took 48 hours and now they take minutes.
Sometimes the disk would be full on a Friday night and would go unnoticed until Monday. Now that we use VictorOps, that call goes out on a Friday night and the problem is fixed right away.
We’re based in Brazil, and it’s incredibly important for us to have international calling with no limits. We customized our escalation policies to call the on-call team member, then escalate a phone call to me after 30 minutes if the alert hasn’t been acknowledged, then escalate to my boss after another 30 minutes. VictorOps can be very very annoying if necessary by actually calling people, and that’s exactly what we wanted.
VictorOps makes incident management surprisingly simple, and simple stuff usually works.