From Scala Unified Logging to Full System Observability
Part 2 of 3: How We Made Logging Great Again

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is the second part in a series on system visibility, the Detection and Analysis part of the Incident Management Lifecycle. If you missed it, read Part 1 first. Unified Interface Like any good engineering approach, we wanted a simple, thin facade that would codify our use-cases, provide a single extension point, and be low-overhead to maintain. With the exception of appender performance, we needed unification of the interfaces in order to…
Read More

From Scala Unified Logging to Full System Observability
Part 1 of 3: Our Original State of Logging

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 1 in a 3-part series on system visibility, the detection part of incident management. These days, with infrastructures spanning tens, hundreds, even thousands of running instances, piping a log file into less is no longer an acceptable means of log research and debugging. Instead, sending them to a log aggregation service with products like Sumo, Elastic, or Splunk is commonplace because searchability is king. Unfortunately, the pursuit of searchability…
Read More

Greg Frank’s Smart Home Experiment:
How one engineer/dad protected his home from flood, fire, and incomplete homework assignments.

Greg Frank is a VictorOps engineer on the plat-frastructure team. An iOT aficionado, Greg built a smart home system using SmartThings and VictorOps. Over the years, I’ve had the misfortune of experiencing some small water leaks at my house. These were minor events: a leaky ice maker and a water valve for a sink. But among my friends and relatives, I have seen and assisted with cleanup for more severe water leaks that involved great expense and effort. Configuring a smart home My instinct as…
Read More

Focus on Detection:
Prometheus, and the case for time series analysis

Detection, in the Incident Lifecycle, is the observation of a metric, at certain intervals, and the comparison of that observation against an expected value. Monitoring systems then trigger notifications and alerts based on the observation of those metrics. For many teams, on-call is primarily about detection. Monitor everything and make sure we don’t miss out! In organizations with legacy monitoring configurations, getting better at Detection is tough. Environments are configured with broadly applied, arbitrarily set thresholds. Sometimes this is due to limitations in the monitoring…
Read More

The Dev and Ops Guide for Incident Management


A modern approach to creating an on-call process, defining team roles, and getting more sleep.

Ask members of a traditional DevOps team what it’s like to be on-call and you’ll likely hear a variety of answers such as, “it’s part of the job,” “it’s stressful,” or the very direct, “it sucks.” As innovative teams incorporate non-traditional Ops folks into the fold, like developers, they need to bring a modern approach with them. This new modern incident management framework leans on automation and seeks to accelerate and streamline the often slow developing on-call process, while also keeping everyone on the same…
Read More

Top Ten Practices of Highly Effective DevOps Incident Management Teams

I recently presented a webinar with DevOps.com about the behaviors we see in teams who represent the leading edge of Incident Management. Using the Incident Management Lifecycle as a jumping off point, we explored 10 tips that nest into each of the 5 phases of an incidents’ lifecycle. Depending on a teams’ relative maturity, these ideas may represent anything from a starry eyed daydream to an example of your normal operating practice. A recording of the presentation, polls, and Q&A can be viewed here. I’ll…
Read More

Microservices Monitoring and Critical Incident Management
How Dynatrace and VictorOps Work Together

Wolfgang Beer, Technical Product Manager at Dynatrace, co-wrote this article. Microservices can be game-changing if, as Martin Fowler says and Adam Drake explains, you have rapid provisioning, basic monitoring, and rapid deployment already in place. And when microservices meet containers, they can boost software engineering power to a whole new level. Together, they form architectures that act like living, breathing entities and are much more adaptable than in the past. But an ensemble of microservices is far more complex to understand, let alone troubleshoot, when…
Read More

Case Study: Skyscanner is Flying High with VictorOps and Monitis

When our friends at Monitis wanted to include us in a case study they wrote featuring Skyscanner, we said of course. In fact, we published this Skyscanner case study: Alerting Beyond Ops Metrics, last year. This write-up, written by Monitis, showcases how Skyscanner benefited from the VictorOps/Monitis integration, to speed incident detection and response time. Skyscanner: one of the world’s best travel sites Skyscanner is a global metasearch engine that specializes in helping customers find comparisons for flights, hotels, and car rentals and a growing number…
Read More

Choosing a Chatbot:
From Hubot to Yetibot, What You Need to Know

If you haven’t picked up your copy of the O’Reilly book, ChatOps, by Jason Hand, then go get the free download for a comprehensive understanding of group chat. This excerpt on Chatbots gives great tips on the most common bots. From firing off an API call to resetting a server, chatbots are a way to trigger a set of automations using chat functionality. Let’s get started! It’s time for a chatbot when third-party integrations are: Not available Not flexible enough to work with your unique…
Read More

This Latest and Greatest VictorOps Integration is Splunk-tastic

Splunk, a fan favorite platform for operational intelligence, transforms machine-generated data into valuable insights that can help make your business more productive, profitable, and secure. We are excited to announce the new and improved VictorOps For Splunk application, available in the Splunkbase. It allows you to add VictorOps as a custom “Alert Action” in Splunk, and provides you the flexibility to get the right information to all of your teams. Having the new VictorOps app available in the Splunkbase makes the installation a breeze. Simply download the app…
Read More