Todd Vernon - October 14, 2014
In this post, a followup to an earlier one, I want to continue to talk leveling up operations.
We have discussed the importance of conducting a blameless post-mortem after incidents. I found an excellent example of this, and it happened to be from my first job in 1995. One of the key findings was inexperience of newer members of the team, something that any DevOps team has had to grapple with as systems have become more complicated and harder to debug and monitor.
I started my software career at NASA where I worked testing experimental aircraft flight control software. I worked on two programs in my years at NASA, the X-29 and X-31. Both were two-of-a-kind aircraft built specifically to test technologies thought too risky to commit to production .
Two aircraft are usually built in case one is lost in the flight test, but generally, it is rare to actually lose an aircraft in flight test. Both these aircraft down-linked hundreds of parameters in real-time to staffed control rooms that could help the pilot interpret data - not unlike the monitoring in complex Internet businesses today. Digital displays and “dashboards” were used by engineers on the ground to look inside the flight control computers for things not displayed to the pilot in the cockpit.
On January 19 1995, shortly after I had left the test program, the first X-31 aircraft crashed on its last official test flight. The pilot successfully escaped by ejecting (18:01 in the video).
Crashing an $80m airplane that represents “half of the fleet” is a big deal, as one might expect. In the months after the crash, NASA conducted an investigation of the cause (and chain of events) that lead up to the crash. NASA also put together this video to capture the elements that lead up to the crash. It was, in fact, a form of a post-mortem documentary to go along with the accident report.
This video represents an excellent version of a blameless postmortem analysis. I encourage everyone who operates complex systems to take note of the findings as they are completely applicable to any complex operations scenarios. If you don’t know your systems, then you can’t know what might go wrong.