The Victor’s Blog

Dan Holloran April 16, 2018


Applications need to function around the clock. However, assuming your service will never experience downtime is unreasonable—no matter how in-depth your QA is. Interconnected systems naturally create a level of uncertainty that defines the meaning of on-call rotations. Establishing intelligent rotations means somebody is always available to address the problem,...

Read More »

Jonathan Schwietert April 13, 2018


We decided to embark on a journey to make our systems more reliable by creating intentional chaos. Our team developed the SRE Council, made up of engineers from different areas of the company, who would be tasked with creating chaos and improving the reliability of our services. To read the...

Read More »

Dan Holloran April 12, 2018


Resolving incidents quickly and mitigating downtime is essential when disaster strikes. However, the importance of the post-incident review after an incident occurs can’t be overstated. Strategically planned and implemented post-incident reviews will allow on-call engineers to more effectively manage incidents that occur and minimize future incidents. This post will act...

Read More »

Dan Hopkins April 10, 2018


VictorOps, like many startups, has gone through major growth in the last couple years. New teammates, new customers, and a maturing organization have all demanded we continue to raise the level of our service, which, in turn, requires improving the reliability of our system. Starting at the beginning of 2017,...

Read More »

Dan Holloran April 06, 2018


In a previous post, we wrote about identifying potential issues, collecting applicable metrics, and defining smart alerts and escalation structures to create more robust systems. Companies will sometimes approach incident management in a reactionary way. But, a lot of the work can be done proactively. Let’s get into the nitty...

Read More »

Thank You For Subscribing!

You’re Awesome

Amanda Boughey March 27, 2018


We recently held our first Chaos Day at VictorOps. Although we dove into this day with our eyes wide open, we still came across several unexpected behaviors. You’ll never be able to limit all anomalies from a Chaos Day, and you shouldn’t—chaos is the point. But, there are certain things...

Read More »

Maggie Gourlay March 26, 2018


The following blog post could be demonstrated in the real world with a short field trip to your local Ikea. Let me start by saying: I love Ikea. When the one in the Denver area opened, I waited in line to be one of the first in the door. Now,...

Read More »

Hannah Klemme March 23, 2018


SREcon Americas is already next week in Santa Clara, March 27-29. And, needless to say, we’re pumped! Here’s everything you need to know about the event, where to meet VictorOps IRL, and how to get your hands on the new DORA book, “Accelerate.” What Is SREcon? SREcon18 Americas is a...

Read More »

Dan Holloran March 21, 2018


Many times, when an incident occurs, the system will automatically notice and resolve the incident on its own. But, engineers will still need to get involved on a number of occasions. Even if your system can’t fix the problem automatically when an application or system breaks, your monitored metrics will...

Read More »

Amanda Boughey March 16, 2018

Be Prepared Incident Management Checklist Blog Banner

When an incident occurs—regardless of the severity—you need your incident management checklist loaded to quickly and seamlessly handle the issue. Like most things, the best way to resolve incidents is to plan and prep ahead of time. Knowing what’s likely involved, at all severity levels, before an incident occurs ensures...

Read More »