Mike Meredith - March 11, 2016
Managing technical infrastructure is as challenging as it has ever been. To meet the demands of the modern marketplace, Operations teams need to meet myriad requirements. A common way of talking about it is in the framework of “Trust Services Principles”, which are summarized as:
Security: Is the system protected from unauthorized physical and logical access?
Availability: Is the system continuously available for its intended purpose?
Processing Integrity: Is data being processed in a complete, accurate and timely fashion?
Confidentiality: Is data that is designated as confidential adequately protected?
Privacy: Is Personally Identifiable Information properly collected, retained and destroyed in a way that protects commitments to user privacy?
VictorOps takes all of these principles seriously, but today I want to talk about the first two, Security and Availability. Specifically, how our decision to co-locate our infrastructure with the Fortrust datacenter has been a great benefit to us and to our customers.
When Todd, Bryce and Dan were founding VictorOps in late 2012, they knew going in that our customers would be relying on us in some of their most crucial moments, and we needed to ensure that we could meet the goals in the Trust Services Principles without excuses. This meant that, in a time when many SaaS companies are starting in the cloud and staying there, VictorOps needed to take a different approach. It meant managing our own servers and locating them at a world-class facility.
Fortrust has been operating in Denver since 2000. If you work in technology in the Colorado Front Range area, you’ve probably heard of Fortrust and their stellar reputation for uptime and security. The Fortrust team clearly set out to build a no-compromises data facility, and their Uptime Institute Tier III certification proves the success of that vision.
We toured a lot of datacenters when VictorOps was getting started, but once we saw the Fortrust facility, we knew our search was over. The strength of the facility was obvious, but what came across even more was the professionalism of the Fortrust crew, and the pride that everyone we met took in the job that they do. They got the vision behind VictorOps right away, and have been both terrific partners and boosters for us during our whole history. (Not to mention acting as a host for DevOpsDays Rockies!)
Physical security is one of those things that are absolutely critical, and an absolute drag at the same time. Fortrust takes that part of the equation off of our plate. With a 24x7 on-site security staff and massive physical security infrastructure, we know that absolutely no one will have access to our servers without our say so. The same goes for fire protection, flood prevention, power redundancy, and the dozens of other details that make a datacenter bullet-proof. We know it’s covered. At the time I write this, Fortrust has been enjoying over 14 years of continuous uptime.
With so many companies taking the approach of “Light up two AWS zones and call it good”, isn’t that how server availability is done now? Not for us. Our philosophy is that the key to uptime is control and knowledge. That means that our platform runs on servers we own and manage, and communicates over a network that we designed and built ourselves.
Several reasons. There’s a lot you can do with your environment to make it highly-available, and we do it all. Resilience is key here. Redundant power feeds and supplies, a fully-meshed network, with each server bonded to multiple switches, and on and on. Any physical component in our environment can fail without causing an outage. We know, because we test - continuously! We don’t have to take a cloud provider’s word for it.
When maintenance needs to happen, we control when it happens, and what procedure gets followed. Patches to infrastructure can happen immediately if we need them right away, or can be scheduled for when we can manage the changes without downtime. Very few–if any–cloud vendors will give you this level of control. If we need to make a change to support a new feature or integration, we don’t need to wait for someone who doesn’t work for us and doesn’t share our concern for our customers. We make the change ourselves.
This doesn’t mean that we’re limited to a single site, of course. We maintain a warm-standby DR site on the cloud that can handle our production traffic if there’s a disaster at our primary site or we need to perform large-scale maintenance. Maintenance failovers are seamless, with no alerts being dropped (we test continuously). As we continue to grow and expand, we expect to move to an active-active multi-site architecture to manage the scale. But even then, we will continue to work with highly redundant network and server designs, and maximize uptime at each site.
There are challenges to this approach, especially for a startup. Your organization needs to have a tremendous depth of knowledge about hardware, network design and management, operating systems management and other “old school” infrastructure concepts. You need to have a team with the knowledge, temperament and level of commitment necessary to maintain best practices. And you need to have great IP, colocation and hardware partners.
The payoff is huge though. Since we designed and built our network ourselves, we understand what it takes to port our platform to cloud providers or other datacenters without relying on proprietary features in the cloud stack. We know our platform will run anywhere, not just on AWS (we’ve tested!).
For a lot of organizations, especially those without a lot of infrastructure knowledge on the bench, a cloud-only strategy can make a lot of sense. You can scale quickly, you can do more with a smaller and less-experienced staff, and you can use proprietary features like load-balancers and cloud-based data stores and save yourself some engineering effort.
But for real, enterprise-grade availability, there’s no substitute for being able to say “We built it, we own it, and we manage it.” At VictorOps we can say that, and we don’t take a back seat to anyone when it comes to uptime.