Top Ten Practices of Highly Effective DevOps Incident Management Teams

I recently presented a webinar with DevOps.com about the behaviors we see in teams who represent the leading edge of Incident Management. Using the Incident Management Lifecycle as a jumping off point, we explored 10 tips that nest into each of the 5 phases of an incidents’ lifecycle. Depending on a teams’ relative maturity, these ideas may represent anything from a starry eyed daydream to an example of your normal operating practice. A recording of the presentation, polls, and Q&A can be viewed here. I’ll…
Read More

Choosing a Chatbot:
From Hubot to Yetibot, What You Need to Know

If you haven’t picked up your copy of the O’Reilly book, ChatOps, by Jason Hand, then go get the free download for a comprehensive understanding of group chat. This excerpt on Chatbots gives great tips on the most common bots. From firing off an API call to resetting a server, chatbots are a way to trigger a set of automations using chat functionality. Let’s get started! It’s time for a chatbot when third-party integrations are: Not available Not flexible enough to work with your unique…
Read More

Reducing Alert Noise: Going from 1000 Alerts to 10 Alerts Overnight

Monitoring tools are great. Here at VictorOps, we are constantly rolling out new integrations with monitoring tools and without them, VictorOps wouldn’t have much to work with. They enable you to check system health every few minutes and often alert you in the same way: by sending an email or notification every time a check finds a failure. If you haven’t set up alert dependencies in your monitoring systems, this can become noisy. In cases where you have configured your monitoring systems to check system health every…
Read More

Don’t Miss This Webinar: The Evolving Role of Context in Incident Management

Providing Situational Context to first responders is one of the most nuanced and critical success factors teams need as they manage and resolve incidents. It’s critical at all stages of incident management, from alert detection through postmortem. Provide no context, and you’ll materially impede resolution efforts. Overwhelm a team with data, and chaos ensues. Understanding the evolving role of context will differentiate your incident management abilities and prepare you for ongoing success. In this webinar, you’ll gain an understanding of the evolving role of situational…
Read More

Context through coupling: JIRA for on-call teams

Some of the most visible artifacts of organizational silos in engineering are tools. Visibility into dashboards, workflows, or documentation is cordoned off in separate and often redundant systems. Conway’s Law manifests in tool choices and integrations as much as in application development. As a team matures an Incident Management or DevOps practice, breaking these tool walls is necessary. In this post I’ll explore a low effort, high value way that you can extend integration between VictorOps and JIRA to break down those silos, and empower your…
Read More

Hiring Out Key Infrastructure: Is the Exit Clearly Marked?

Recent events on the Internet have produced a lot of headlines, and if you’re an Ops Manager, a lot of headaches. Yesterday’s AWS outage caused widespread issues across several industries, and many affected organizations are waking up today realizing they didn’t have a good way to respond, other than waiting for Amazon to identify and correct the issue. Outages happen to everyone; the key is knowing how to respond, and indeed knowing whether you can respond at all. Outsourced providers help achieve scale and redundancy…
Read More

Actionable Alerts

Today, I’d like to build on my earlier post, and expand on the third fear that faces folks joining an on-call rotation. In that post I outlined some of the basics around monitoring and alerting practices for developers or other non-traditional ops folks. This topic begs its own post, as the idea has transformed how Incident Management works. I remember the first time I read the phrase “Actionable Alerts”, from this post at Ted Dzuiba’s blog way back in March of 2011. If you haven’t…
Read More

On-Call Ways and Means: A Developer’s Guide

Bringing non-traditional Ops folks, including developers, on-call can be a tricky process. Initial reactions tend to be highly polarized, either total pushback and refusal, or a meek acceptance coupled with fear of the unknown. For the former, understanding the root of the refusal is useful. For the latter, providing clarity and training is important. For those unfamiliar with Incident Management, there are some common misconceptions that fuel a fear of accepting on-call responsibilities. Chief among those are: – I’m going to be woken up for…
Read More

Introduction: How to Build Internal Feedback into the Product Discovery Process

Aaron is the Director of User Experience at VictorOps. His mission is to solve real problems for real people. He is currently striving to improve quality of life for real people on-call. I recently wrote about how tradeshows are invaluable feedback-gathering experiences for product people. Now, let’s start a conversation about how to build internal feedback into the Product Discovery process. Remind me–what is Product Discovery? Product Discovery is all about identifying opportunities that serve and satisfy customers. The opportunities we’re referring to can range…
Read More

On-Call Handoffs: Empowering Adaptability in Incident Response

Managing on-call teams has always been a challenge in complex environments. With the continued adoption of Continuous Delivery, the challenges are squared. Now, not only do you have to manage a complex environment, the environment is changing dozens of times per day. On-call today has to be less about a strict execution of predefined procedures, and more about adaptability. Smart people, acting with good situational context, tend to make the best decisions. Those same smart people must be empowered with necessary skills and tools, but…
Read More