Firefighting is a Team Sport

Organizing an on-call roster is tough. Aligning skills, experience, and availability with specific application technologies is difficult. In most cases you settle for “close enough” and hope smart people make good decisions. Skills and scheduling are really only the beginning; effective incident management requires focus on how your on-call team operates. I like to think of team dynamics on two dimensions. First, there is the structural organization of the team–people playing roles, workflows, and escalation paths. This is important to on-call teams because it impacts…
Read More

On-Call Ways and Means: A Developer’s Guide

Bringing non-traditional Ops folks, including developers, on-call can be a tricky process. Initial reactions tend to be highly polarized, either total pushback and refusal, or a meek acceptance coupled with fear of the unknown. For the former, understanding the root of the refusal is useful. For the latter, providing clarity and training is important. For those unfamiliar with Incident Management, there are some common misconceptions that fuel a fear of accepting on-call responsibilities. Chief among those are: – I’m going to be woken up for…
Read More

Introducing the VictorOps Integration Hub: Guaranteed to Make You Smile

We at VictorOps have announced a number of integrations over the years; each of which we have been thrilled to offer our customers. And our customers love these integrations as much as we do. In our line of business, technology integrations are key for busy, overwhelmed on-call teams who face a barrage of new alerts each day. So in our relentless effort to quiet alert noise and make being on-call suck less, we wanted to make seamless integrations easier than ever. Introducing the Integration Hub…
Read More

Feedback Request: The Third Annual State of On Call Survey

If you’re on call, or if you know people who know people on call, please participate in the 3rd annual State of On Call survey. Your contribution will help complete the picture of today’s on-call experience. Last year’s report included these takeaways: This year’s survey is a bit more strategic, and explores areas like these: • Developers on call and other changing behavior • Use of microservices and containers • Evolution of the NOC • The costs of downtime • Incident management maturity indicators • ITIL usage…
Read More

Success Stories for Engineers On-Call

Real-time monitoring and alerting are critical to maintaining the performance and security of your infrastructure. But, with today’s astounding access to data, it is important to use the right technology to manage alerting in a way that’s customized to your environment. If not, alert fatigue will take hold and your teams will lose their ability to respond to incidents quickly and effectively.

Read More

Alert Aggregation: How to Keep Heads from Exploding

The more I speak with people who deal with a flood of incoming alerts, the more I see why the traditional on-call role has such a high rate of burnout. People in the operations role are expected to monitor systems and maintain nearly 100% uptime. 99.9999% at the very least. If each monitoring system has its own fancy version of simple alerting, then in the spirit of not wanting to miss a beat, the person watching the systems receives simple alerts from a multitude of…
Read More

New Integration! Threat Stack and VictorOps

Hot off the webinar we presented together on modern-day security and incident management, our friends at Threat Stack built an integration with VictorOps. Now Threat Stack PRO level users can receive critical security alerts through the VictorOps platform. How to set up the integration From the VictorOps web portal, select Settings, then Integrations. Under Incoming Alerts, select the REST endpoint option. Copy the full Post URL to your clipboard. Now within Threat Stack, go to the main web portal. Select Settings, then Integrations. Choose the VictorOps…
Read More

Know Anyone With This High-Burnout Job?

Last night at three a.m, Dale was awake and sweating, frantically trying to fix a technical problem that broke his company’s online store. The shopping cart page had been down for four minutes, and many international customers were trying to place orders. How many of them gave up? After waking Andy for information about the latest system updates, Dale identified and fixed the problem. The site was back up by four fifteen a.m. Heart pounding, cortisol rushing through his body, Dale got back in bed…
Read More

Updated: The Buyer’s Guide to Modern Incident Management

It’s here! Check out the updated VictorOps Buyer’s Guide to Modern Incident Management. One of our most popular resources, this guide walks you through each step in the incident management lifecycle. Each step also includes a comprehensive list of questions to ask as you consider possible solutions. We updated the awesome original version with new content to reflect where the market is today; adding more graphics, education, and links to additional resources that go deeper into specific subject areas that might appeal to you. New: An Illustration of…
Read More