For the last 20 years, businesses have been transformed into 24×7 revenue machines. Systems have slowly evolved from batch processed engines that could recover from overnight downtime to real-time business processes where the relevance of “an hour ago” is virtually meaningless.

While this sea change has happened slowly, it’s only in the last few years that the critical nature of this necessity has come to light.

On May 6, 2010 at 2:45 PM, an unprecedented and completely unexpected event happened that shook the global stock exchange system to its core. In the matter of a few minutes, the Dow Jones Industrial Average plunged 9% losing over $1T in value, sending ripples through the world markets.


Subsequent analysis took over 5 months to analyze the data from a scant 5 minutes of trading data from thousands of disparate systems. The analysis showed that the systems in place largely worked as designed and simply implemented the rules that had been put in place. What was different was the situation under which the systems ran. This is best described as a data-dependent processing outcome that was not envisioned until the crash.

Humans play an important and essential role in mission critical IT systems – whether it’s the NYSE, the IT systems inside a Fortune 1000 company or the website uptime of a small e-commerce site. It took months to track the transactions leading to the onset of the flash crash and at the end of the day, automated logic put a 5-second pause on trading, enabling the systems to recover. It was fortunate that the out of control trading scenario had been foreseen by system designers…however, this is rarely the case with most systems.

Just in the last 2 years, computing power has doubled, according to Moore’s Law. System complexity has also risen at staggering levels. In order for teams to keep their hand on the throttle of the system, tools need to be smarter than before. The days of the cowboy IT departments are waning. This complexity, along with the advent of ways to write and deploy code faster and faster, has led to the discipline commonly referred to DevOps.

Much confusion surrounds the evolving concept of DevOps. Simply put, DevOps is the recognition that systems have grown in complexity to the level that an “operator” of the system is someone who has to know that not only are all the IT dashboards happy, but that those systems are actually functioning correctly to derive the ultimate outcome. This is a tall order for all but the simplest of IT infrastructures.

In Part II of this post, I will talk about some historical examples of operations organizations that have leveled up their global operations in order to accomplish things that had never been accomplished before. Stay tuned.