In today’s world, a company must be a Learning Organization in order to be successful and innovative. Learning from both failure and success, in order to implement small incremental improvements is critical. But until you implement and apply new information, you haven’t truly learned anything and you certainly haven’t improved.
According to the 2015 Monitoring Survey, most companies leverage metrics from monitoring and logging purely for performance analytics and trending. If high availability and reliability are important, they also leverage metrics to alert on fault and anomaly detection. Despite these best practices, the metrics are primarily only used as context to keep things running or return them back to normal if there’s a problem. Rarely is that data used as a method to identify areas of improvement once services have been restored.
When an outage occurs to your system, you will absolutely repair and restore services as best you know how, but are you paying attention to the data from the recovery efforts? What were operators seeing during diagnosis and remediation? What were their actions? What was going on with everyone, including conversations? Do you have a documented step-by-step replay of exactly what took place during that outage?
This old-view perspective on the purpose of monitoring, logging, and alerting leaves the full value of metrics unrealized. It fails to address what’s important to the overall business objective and it lacks any hope of seeking out innovation or disruption of the status quo.
In the webinar I’m doing with DevOps.com on 6/9, you’ll hear more detail about these key takeaways:
— The distinction of MTTR and it’s importance over MTBF
— What it means to be a Learning Organization and how your company can achieve that status
— How to identify if you are making the best use of your IT Operation metrics