Showing SRE Posts

Matthew Boeckman December 13, 2017

Focus on Readiness

The Readiness phase of the Incident Lifecycle is the time a team spends focused on learning—about incidents, systems, and themselves. In Readiness, we move the Analysis phase forward into actionable steps to improve. This may be inclusive of architecture and application changes, creating or updating runbooks, or Game Days. Game...

Read More »

Matthew Boeckman October 12, 2017

The Remediation phase of the Incident Lifecycle is the action packed firefight. The main event. The diagnosis. The fix. The attempted fix. The workaround. In this phase we address whatever pattern created an incident in the Detection phase of the lifecycle. Teams who focus on reducing Mean Time To Repair...

Read More »

Jonathan Schwietert July 27, 2017

Part 3 of 3: Metrics and the Bigger Picture Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 3 in a series on system visibility, the Detection and Analysis part of the Incident Management Lifecycle. If you missed them, read Part 1 and...

Read More »

Jonathan Schwietert July 20, 2017

How We Made Logging Great Again How We Made Logging Great Again Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is the second part in a series on system visibility, the Detection and Analysis part of the Incident Management Lifecycle. If you missed it,...

Read More »

Jonathan Schwietert July 13, 2017

Part 1 of 3: Our Original State of Logging Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 1 in a 3-part series on system visibility, the detection part of incident management. These days, with infrastructures spanning tens, hundreds, even thousands of running...

Read More »

Matthew Boeckman June 13, 2017

I recently presented a webinar with DevOps.com about the behaviors we see in teams who represent the leading edge of Incident Management. Using the Incident Management Lifecycle as a jumping off point, we explored 10 tips that nest into each of the 5 phases of an incidents’ lifecycle. Depending on...

Read More »

Todd Vernon October 24, 2016

Last Friday the internet as a whole suffered an attack that exposed some of the issues surrounding a connective fabric that has literally come forth in most readers’ lifetimes. I guess, in retrospect, this should have been expected, but for a lot of companies it was the convergence of good...

Read More »

Jessica Kahn September 08, 2016

After his podcast interview about blameless postmortems with Richard Campbell of RunAsRadio, DevOps evangelist Jason Hand sat down with me to talk more about blame, the benefits of transparency, and the problems with root cause analysis. **JK: In the podcast, you discuss the concept of a blameless culture. What takeaways...

Read More »

Jason Hand February 19, 2016

16214699701_55072899bb_z

We all remember the game from our childhood where one person whispers a phrase to the person directly next to them, who in turn shares the phrase with the following person in line. This continues through a group of people until it makes its way back to the original source....

Will La November 04, 2015

top 20mobileapps

We started this blog series off by defining what a MVR is - next up, what to include in your Minimum Viable Runbook. Develop your Incident Management Strategy - Lean Style When using the minimum viable approach to building your runbooks, you want to leverage digital automation to capture your...

Read More »