Todd Vernon - November 10, 2015
Lately I have been reading the excellent book Digital Apollo. It explores the evolution of digital control systems and the man-machine interface that evolved during the development of space flight and ultimately the Apollo missions. It’s a fantastic book - more technical than most - but very approachable to those not familiar with flight control, embedded software, or the challenges of building such systems. As I read the book, I could not help but compare the way space missions were executed to that of the role of DevOps in modern SaaS businesses.
I started my career at NASA testing digital flight controls for an experimental aircraft the X-29. The X-29 flight test program was just the latest in the series of one-off aircrafts that started with the Bell X-1 and moved to the X-15 that laid the groundwork for Apollo. As a result, flight test was executed in a very similar fashion in all these programs. Nearly every switch, surface, actuator, probe was instrumented and that data was downlinked in real-time to a control room as the airplane or spacecraft flew.
As the vehicles became more fly-by-wire and had digital computers at their core, those computers also downlinked a lot of their internal state variables to the ground where teams of engineers could keep track of every button push, flight mode, acceleration in real-time, helping the pilot look for things that could happen to potentially end the mission or end his life.
The pattern between the world of space mission operations and the evolution of SaaS businesses is converging. While generally no one dies if your SaaS service fails to operate, the implication of downtime every year gets more and more real. If you operate a platform that services customers that collectively pay millions of dollars a day for your product or service, that is serious business.
Like state variables downlinked from Apollo, we now watch the equivalent using tools like New Relic as our systems support millions or billions of customer transactions through the services we have built. While Apollo’s AGC had to work for several hundred hours at a time, our SaaS services get turned on once when we launch our company and the mission goes on forever. As a result, we are replacing rooms of engineers there for days with systems that connect them to the technology all the time.
Modern monitoring tools are starting to approach the quality of observation we had back at NASA for immediacy of data, but at the same time, now far surpass those relatively crude tools for the spontaneity of exploration and discovery. Today, I get an alert on my iPhone when some part of our system is acting inconsistent and I can interact with our engineers in real-time regardless of location.
Like the rooms of engineers that supported an Apollo mission, today we are on the verge of supporting our complex systems with a virtual room of engineers using tools like VictorOps. As systems become more complex, it becomes more likely the problem needs to be solved by the person that wrote the code in the first place. Very often only that person has (or ever had) the knowledge of how the system works with such intimacy as to know how to fix it or work around it to keep the mission (business) functioning.
On Apollo 14, an engineer noticed while the space craft was in Lunar orbit that the software bit, buried deep in the guidance and navigation computer inside the Lunar Module (LM), signified that the descent program abort was initiated. This was caused by a loose bead of solder that effectively kept “pushing” the abort button and was not a problem or even noticed by the crew as the descent program that would land the astronauts on the moon was not running yet.
Had that program been initiated, as was scheduled only minutes later, the mission would have been aborted and quite likely the crew would have been lost. The knowledge of that specific engineer that knew how the system would misbehave was enabled by the ability to be connected to software through advanced monitoring, and the ability to act on that data in real-time. If you removed any part of the equation, Apollo 14 would have been much different.
This is the basic DNA of how we look at our product at VictorOps. We connect engineers to the mission critical machines that run your business. If you expect the unexpected and outfit your teams accordingly, you can be ready to respond to any problem faster and more accurately then your competition.