Tech Knowledge DevOps

Essential Stats for Your On-Call Life

Amanda Boughey - January 02, 2018

Essential-Stats-For-Your-On-Call-Life-Blog-Banner-01

Curious about the state of on-call, but don’t have enough time to read an eBook? Don’t worry. We’ve gathered the important stats in one place for your skimming pleasure.

Here’s What We Learned.

DevOps is still new, and so is mature incident management. In fact, almost 75% of respondents have been practicing DevOps for less than 2 years, and over 40% of respondents identify as novice or beginners when it comes to incident management maturity.

The more you practice DevOps, the better your on-call experience. 40% of those not practicing DevOps report a bad on-call experience. Turn the tables and 40% of those practicing DevOps for 5 or more years have a positive on-call experience.

But who is on-call? Operation teams still own the brunt of on-call duties, with developers quickly gaining on them. By functional area, here’s who is most likely to be on-call:

  1. Operations
  2. Development
  3. DevOps
  4. IT
  5. Support

Of those team members on-call, almost 70% feel neutral, somewhat good, or extremely good about being on-call.

What are the top reported problems about being on-call?

  1. Alert noise
  2. Lack of remediation information
  3. Inefficient communication
  4. Inaccurate or difficult reporting

Tools for On-Call

Your on-call experience won’t be successful without the right tools. But what tools seem to help the most? People use tools for monitoring, automation, communication, and more.

What are the most frequently used monitoring tools?

  • New Relic
  • AppDynamics
  • Monitis
  • SolarWinds
  • PRTG
  • LogStash
  • Splunk
  • Sumo Logic
  • Amazon Cloudwatch
  • Pingdom
  • Catchpoint
  • DataDog
  • Sensu
  • Zabbix
  • Naggios

How about the most common automation tools?

  • Puppet
  • Chef
  • Ansible
  • SaltStack

During a firefight, some tools rank more important than others.

  1. Logs
  2. Chat platform
  3. Monitoring tool dashboard
  4. Graphite or graph tools
  5. Runbooks
  6. Email

How widely-used are these tools?

  • 80% currently use automation tools
  • 80% currently use ChatOps
  • 70% are currently moving to microservices and containers

Don’t Take Alert Fatigue Lightly

All these tools mean nothing if you run into alert fatigue. And according to our results, over 60% say alert fatigue is a problem. So it makes sense that same percentage is working to proactively reduce alert fatigue. 30% take time on a weekly basis to try and reduce alert fatigue.

What steps are people taking to remediate alert fatigue?

  • Making alerts contextual
  • Making alerts actionable
  • Reducing redundancy

When It All Comes Together

An important part of being on-call is proactively making sure it gets better each time. This can be done with proper notes and runbooks, as well as holding blameless post-mortems. Lucky for us, only 7% think the purpose of post-mortems is to assign blame. We couldn’t agree more, which is why we’ve transitioned from post-mortems to post-incident reviews.

What are the benefits of post-incident reviews?

  • We gain a feeling of empathy across different departments
  • We uncover bottlenecks and areas of friction in our processes
  • We’re able to update remediation information more quickly

It’s not surprising that post-incident reviews are a high priority. When something breaks, a lot is at stake.

What’s the most troubling when it comes to downtime?

  • Loss of revenue or decreased stock price
  • Diminished perception of brand and negative publicity
  • Customer defection to competitors or competitor retaliation

If you want to dive into more information about the state of on-call, make sure you download the eBook to see what we didn’t cover here.

*All stats provided from the 2016/17 State of On-Call