Sophicware manages cloud infrastructures on behalf of its customers and serves as an early warning system for customers and cloud providers alike. VictorOps helped improve the company’s time-to-resolution from hours after a potential attack to within minutes. In the words of founder Jurnell Cockhren, here’s how they did it.
“VictorOps is on-call management with a focus on actually solving the problem.”
Let’s say it’s 1:30 am when I get an alert. I acknowledge the alert on my phone in a couple seconds. The VictorOps timeline helps me quickly decipher the likely causes. With VictorOps, I know what happened, and I am well informed about the situation.
We tried two other on-call providers before, but we like the way VictorOps organizes and delivers information. That was important for us.
Smarter alerts for a “pristine” infrastructure
As a service provider that works across many customers and cloud providers, we have to make sure that our infrastructure works better than anyone else’s . Seventy percent of our use of VictorOps is to monitor our monitoring – to make sure our internal tooling is working and in a pristine state.
Our primary goal, and the reason we chose VictorOps, is to monitor recurring events and create systems that are mostly self-healing.
VictorOps provides an information stream where we can automatically insert context to jump start a workflow to solve a problem. By annotating alerts, we provide as much information as we can to a specific alert, so we don’t have to repeat the manual steps.
With certain pieces of software that are long-running, they get stale, so we recently integrated VictorOps with Hubot. We can tell Hubot “something is going on with the indexer so restart automatically.” Right now, we are trying to find ways to throw more information in VictorOps and use the Transmogrifier feature to add even more context.
Because we monitor across many cloud providers, we often have early indicators when things are going wrong and can warn our clients and even the cloud providers themselves before systems are impacted.
VictorOps helps us get the alert and context as fast as possible. For example, we got an alert from VictorOps for an attempted login spike. I looked back at the last 15 minutes of activity, and there were 3,000 login attempts on the server. We were able to stop the attack and then talk to the cloud provider to ban certain IP addresses altogether. This helps us advise our clients in what practices to employ in the future.
Every week I am grateful to use VictorOps because we are the first responders when attacks occur.
VictorOps is less about managing on-call, (which it does very well),and more about acting on problems quickly.