VictorOps is now Splunk On-Call! Learn More.
Alex Papadimoulis is an expert in CI/CD, release pipelines and resilient software. Alex is the President and CEO of Inedo and is passionate about DevOps and methods for delivering reliable services faster. Below, he’s so graciously contributed his thoughts about configuration drift here on the VictorOps blog.
Keeping systems constantly up while maintaining total system security is vital to modern businesses. But, speed and consistency sometimes clash. Do you fix a problem fast or do you fix it right? Fast fixes usually win and often cause further systemic problems. And, because system health is so important, every issue becomes a red alert.
This leads to alert fatigue, or the “exposure to a high volume of frequent alerts, causing desensitization to critical issues.” Alert fatigue hurts both teams and businesses. Not only does this desensitization mean truly serious issues are lost on the high priority seas but it also takes a physiological and psychological toll on incident response engineers (as this article, and this one, explain). This system of over-alerting and fatigue is unsustainable.
To keep employees and businesses safe, incident response teams need to break the cycle of alert fatigue by eliminating one of its root causes – configuration drift.
Even if your team is diligent about testing and carefully stages deployments before releasing to production, configuration drift can, and does still happen.
Configuration drift is the culprit behind that splitting headache caused when changes create unanticipated, undocumented differences between your staging and production environments, leading to breaks upon release. Staging and production are meant to be identical but drift occurs for many different reasons in real-world shops, as this previous Inedo Blog post explained:
Critical package updates are made at breakneck speeds to address a security vulnerability or incident and often ignore procedure in favor of speed.
When testing servers, a developer may make a manual configuration change to better document or track a bug, which could help define that issue, but if the configuration change isn’t changed back, it will cause drift.
Adding more resources to bolster server configuration can help systems cope with peak load times but are often unplanned or undocumented, eventually leading to configuration drift.
Simply put, configuration drift occurs whenever someone makes a change to the production environment without recording those changes and without ensuring complete parity between staging and production. And, although it’s unintentional, it can end in unanticipated bugs and the resulting flurry of pleas for rapid incident response.
In other words, configuration drift is a normal part of DevOps that hurts your team and the business.
Drift can cost your team huge amounts of time and stress as well as costing your company money and its competitive edge. Consumers today expect total uptime and they’re happiest when their expectations are met.
But, beyond hurting customers and the business, configuration drift hits incident response engineers, those who are on-call to handle incidents as fast as possible to maintain 99.9% up-time, the hardest.
When incidents occur, they must respond immediately, which may lead to cutting corners. When corners are cut, drift happens, and when drift happens, incidents occur. It’s a vicious circle that leaves incident response engineers trapped.
The cycle perpetuates even more aggressively when information is allowed to silo. Unless information is centralized, incident response engineers won’t necessarily know the importance of each incident and may not have visibility into what caused the issue they’re trying to fix. They end up stressed out and wasting time and resources while playing detective. This gradual, incremental erosion of standard, actionable alerting processes leads to alert fatigue and continues to slow down remediation and response times.
Businesses can be doing much more to stop this erosion and free their Response Engineers to focus on the highest-importance issues that arise.
VictorOps recommends that incident investigations should focus on people and processes, not only technology. This is excellent advice. But, what do you do in the meantime to prevent the negative consequences of configuration drift and alert fatigue?
Unsurprisingly, documentation is your best friend in DevOps. The fuller the picture you have, the better you can act/respond in any given case. There are many server monitoring tools on the market today, but almost none of them extend to monitoring for and even preventing configuration drift.
In fact, the only one that does is Inedo’s Otter.
According to VictorOps, you should document everything in a centralized tool and manage the CI/CD pipeline and on-call incident management process in one single tool. Inedo shares this philosophy. All of Inedo’s tools have a centralized dashboard and native documentation that increase transparency across teams.
The Otter dashboard shows your server configurations as either healthy or drifted and Otter can be programmed either to notify teams that drift has occurred or to auto-remediate, rejecting unplanned changes and bringing environments back to a set configuration. Inedo’s tools offer an API that allows you to integrate with any tools you’re already using — and of course, Otter integrates flawlessly with Inedo’s CI/CD tool, BuildMaster.
Configuration drift is a pain common to most shops but it can absolutely be prevented. Connecting people with the right information and the right tools can maintain environment parity between staging and production, mitigating the possibility of drift and helping teams quickly fix it when drift occurs.
On-call incident response engineers regularly experience huge amounts of stress and pressure to deliver quick fixes that please customers and protect the business. Eliminating the pain of drifted configuration and the bugs it lets through can reduce the technical debt incident response engineers incur when reverse-engineering undocumented configuration changes. By removing distractions, an engineer’s time, attention and talents can focus on more important incidents and fixing them quickly and reliably.
To learn more about how Inedo can remediate drift and help streamline your CI/CD processes, contact Mike Goulis at email@example.com.