Incident Response for Hybrid Clouds

Michael Churchman June 07, 2019

DevOps Monitoring & Alerting On-Call
Incident Response for Hybrid Clouds Blog Banner Image

For most enterprises (and many smaller companies) which have moved or expanded into the cloud, the hybrid cloud is the cloud.

It’s not hard to see why. Enterprises are typically unwilling to move all of their operations to the public cloud, regardless of security requirements. It makes more sense to maintain the elements of their operations which require high-security on-premises (as well as those which are tied to local hardware due to technical/legacy requirements), even as they move low-security operations and public-facing services to the public cloud.

But hybrid cloud architectures come with challenges. By blending together paradigms from both the public cloud and on-premises infrastructure, IT teams are required to think in nuanced ways, with both sides of their brains.

To put this challenge into context, let’s take a look at how a hybrid cloud changes the way IT teams work. We’ll start with an overview of what makes hybrid cloud architectures unique, then discuss what this means in the context of incident response. We’ll focus on responding to security events in particular, but the lessons apply generally to all types of incident response.

Limits to control

On the private side of a hybrid cloud, the IT and development teams together can closely control and monitor both application software and the overall cloud and hardware infrastructure. They can manage security at all levels, with as much granularity as they choose (given the capabilities and limits of their security tools). If they were managing a strictly private cloud (for on-premises use only, with no public cloud interface), they would at least theoretically have full control over its security.

On the public side, however, your in-house IT and development teams are likely to have no control – or very limited control over many of the factors which typically have a major impact on security. Instead, they have to rely on the cloud service provider’s security resources and IT staff — which in many ways places them in the end-user role with limitations to autonomy and the capacity to act.

Greater attack surface

The public cloud presents a much larger and more complex attack surface than is typically the case with a private cloud. The cloud itself is generally larger and more complex, and because it’s public, it’s multi-tenant by nature. Your operations share both space and resources with an unknown (but generally large) number of other users whose activities and intentions you’re likely to know nothing about.

Potential attackers are also likely to have access to detailed information about the public cloud environment, including APIs for the resources which you’re using (like the VictorOps API), authentication protocols, default security settings, and likely points of vulnerability. The public side of a hybrid cloud is, above all, public.

Through the front door

What does this mean in practical terms?

The first and foremost thing to understand is that any system is only as secure as its least secure component. If you keep your back door locked but your front door is unlocked, your house isn’t secure. If the private side of your hybrid cloud has minimal attack surfaces, but you have inadequate control over (or even knowledge of) the attack surfaces on the public side, then neither side is really secure.

If attackers gain entry to the private side of your hybrid cloud by means of vulnerabilities on the public side, they’ll have effectively circumvented the outer (and much of the inner) perimeter of your private-cloud security system, resulting in a potentially much more severe breach than if they had attempted a direct attack on your private cloud.

When this happens, you may also find that your ability to trace and analyze the attack is limited. The security and monitoring tools which you’re using in your public cloud operations may allow you to trace the intrusion from its initial point of entry into the virtualized environment in which your cloud-based applications run. But, these tools may tell you little or nothing about how the intruders got there. For that, you may need access to the cloud service provider’s logs and monitoring data — and for a variety of reasons, they may be reluctant (or simply slow) to provide you with that information.

The DevSecOps View

Different clouds, different tools

Tools and infrastructure also have a major impact on incident detection and response. Differences between the public and private side in monitoring, log analysis and alerting tools can affect the way that a response is categorized, the level of severity assigned to it, and ultimately, the nature of the response.

These differences can become even more pronounced by differences in cloud architecture and infrastructure. A possible intrusion that presents a relatively low level threat in one infrastructure (for example, an attempt to crack a password for a virtualized application in the public cloud) may represent a much higher level of threat in another infrastructure (if, for example, the virtualized application provides access to sensitive data in the private cloud).

Incident response as a DevOps practice

The security demands of the hybrid cloud require an incident response strategy that goes beyond both the traditional, bare-bones reactive response and the more sophisticated collaborative and role-based tactical response.

In effect, your response strategy and your response teams must both embody the basic principles of DevOps. This means, among other things, full participation by both IT staff and developers, along with a commitment to communicate, to collaborate, and to approach each incident as an opportunity for learning and remediation.

What do you need to make this work? In many ways, it comes down to three key elements: identification, planning and management.

Identification

Identify points of access from the public cloud to the private cloud, and map them out. Identify potential vulnerabilities in the public cloud, and map the way in which an intruder might use them to gain access to the private cloud. Much of this information should be available from standard security tools; your goal, in this case, will be to use this knowledge to guide your response.

Planning

Planning isn’t just a matter of organizing response teams and setting up on-call schedules. You need to know what kind of information is likely to be available from the cloud service provider (and what will not be available), as well as who to contact, their probable response time, and possible limits to availability. It’s also important to have a contingency plan in case you simply can’t get a response from the cloud service provider, and are left defending your private cloud against an intrusion from the public side with only limited access to public cloud security resources.

Management

Management also means much more than the basic organization. For effective incident response in the hybrid cloud, there is no substitute for a full-featured incident management system which was designed from the ground up to embody DevOps principles, and which is designed to integrate with both development and monitoring tools.

In many ways, the keys to automated incident management are rule-based alerting, noise filtration, intelligent routing of alerts, and perhaps more than anything, fully contextualized alert information. You don’t need a beeper. Nobody needs a beeper at the end of the second decade of the 21st century. What your incident response team needs is accurate, context-based information at the moment that they receive an alert.

DevOps-based incident response requires DevOps-based incident management. The result will be not only faster and more effective incident response, but a better and more profound application of DevOps principles across your delivery chain.

Try a 14-day free trial or request a free demo to see how VictorOps can help bolster your own team’s DevOps-based on-call incident management and response workflows.

About the author

Michael Churchman started as a scriptwriter, editor, and producer during the anything-goes early years of the game industry. He spent much of the ‘90s in the high-pressure bundled software industry, where the move from waterfall to faster release was well underway, and near-continuous release cycles and automated deployment were already de facto standards. During that time he developed a semi-automated system for managing localization in over 15 languages. For the past 10 years, he has been involved in the analysis of software development processes and related engineering management issues. He’s a regular Fixate.io contributor.

Ready to get started?

Let us help you make on-call suck less.